Adaptive Migratory Scheme for Distributed Shared Memory 1. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University

Size: px

Start display at page:

Download "Adaptive Migratory Scheme for Distributed Shared Memory 1. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University"

Nelson Taylor
5 years ago
Views:

1 Adaptive Migratory Scheme for Distributed Shared Memory 1 Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX fjhkim,vaidyag@cs.tamu.edu Technical Report November 1996 Abstract This paper presents an adaptive migratory scheme for software Distributed Shared Memory (DSM). On a migratory sharing, a message for sending a copy of a page to a remote node, on which a page fault occurs, is directly followed by an invalidation request from the remote node. Our adaptive migratory scheme eliminates an overhead for invalidation by self-invalidation on sending a copy of a page. Each node can independently detect migratory memory access pattern and self-invalidate local copy of a page by using the local information only. Experimental results show that the performance is improved by dynamically selecting the migratory protocol. Keywords: distributed shared memory, release memory consistency, adaptive protocol, migratory protocol, competitive update protocol, performance evaluation, cost analysis model. 1 This work is supported in part by the National Science Foundation under grant MIP

2 1 Introduction This report presents an adaptive migratory DSM algorithm which can dynamically selfinvalidate local copy of a page on a migratory memory access pattern. The adaptive migratory algorithm is implemented based on our previous work [9] which can adapt to memory access patterns to reduce coherency overhead by adjusting \update limit". Adaptive DSM [9] outperforms a competitive update protocol as well as an invalidate protocol and an update protocol in many types of applications. However, [9] can not adapt to a migratory memory access pattern. On a migratory sharing, a message for sending a copy of a page to a remote node, on which a page fault occurs, is directly followed by an invalidation request from the remote node. Our adaptive migratory algorithm eliminates an overhead for invalidation by selfinvalidation [11] on sending a copy of a page to the remote node. The proposed scheme allows each node to independently choose (at run-time) a dierent protocol (migratory, invalidate, or competitive update protocol) for each page by using the local information only. Experimental results show that the performance is improved by dynamically selecting the migratory protocol. Adaptive protocols for migratory sharing [4, 12, 14], and self-invalidation [11] have been proposed previously for hardware cache-coherent scheme. [4, 12, 14] dynamically identify migratory shared data and switch to migratory protocol in order to reduce the overhead. [11] can predict blocks to be invalidated and perform self-invalidation. Dynamic page placement schemes [3, 10] including page migratory have been implemented in OS-level NUMA memory management. Their schemes can dynamically select either migration or replication for page placement according to application and architecture. Our adaptive migratory scheme is implemented for a software DSM and dierent from others as follows: design domain: [4] is designed on bus-based multiprocessor and directory-based cache coherence multiprocessor. [12, 14, 11] are based on directory-based cache coherence multiprocessor. In a bus-based multiprocessor, requests (for read/write miss and invalidate) by other nodes can be detected via bus, and in a directory-based cache coherence multiprocessor, home node maintains directory entries. In these architectures, global 2

3 state (number of cached copies, last invalidator of a block) can be known by some or all nodes. However, these schemes can not be used directly in the DSM where no actual owner exists like in Munin [2] or Quarks [8]. { No node has global information about copies of a page. Our scheme is implemented incorporated into a software DSM in which memory coherency is maintained in totally distributed manner. Our adaptive migratory scheme does not need global information. Each node tries to detect the migratory memory access patterns only with local information. [3, 10] are proposed for dynamic page placement in NUMA architecture. Their dynamic page placement policy can not be applied to DSM due to architectural dierence. On page fault, in NUMA architecture, a local node can access remote memory without page allocation in local memory. However, in DSM, remote memory access is not allowed (need to allocate the copy of the page to local memory). protocol switch: Since our scheme is based on a \cost-comparison" approach, migratory protocol is chosen when the protocol is deemed optimal. If another protocol is optimal on a migratory sharing in special cases (e.g., two nodes update shared memory alternatively), other protocol (competitive update protocol, in this case) is chosen. [4, 14, 12] select a migratory protocol when memory access pattern is migratory sharing. [10] needs to tune the policy parameters on an application and architecture basis. basic protocol: [4, 14, 11] are based on invalidate protocol, and [12] is based on competitive update protocol. [3, 10] are based on page placement scheme for NUMA multiprocessors. Our scheme is based on adaptive DSM in which each node can independently choose (at run-time) a dierent protocol for each copy of a page. hybrid protocol: In [4, 12, 14], all copies of a block enter migratory mode or exit from migratory mode. In our scheme, each node decides independently to use migratory protocol. Therefore, some nodes can use a migratory protocol while the other nodes use other protocol (invalidate or competitive update protocol). This hybrid feature can be implemented by slightly modifying the proposed scheme. Table. 1 shows summary of each scheme. 3

4 Scheme Design domain Protocols (Schemes) Features [4] Dir or Bus Inv + Mig [14] Dir Inv + Mig [12] Dir Comp + Mig [11] Dir Inv + Self-Inv [3, 10] MM-NUMA Remote + Replicate + Mig CC [9] SDSM Inv + Comp CC Proposed SDSM Inv + Comp + Mig CC + TD Bus = bus-based cache coherence multiprocessor Dir = directory-based cache coherence multiprocessor MM-NUMA = memory management system for MUMA multiprocessor SDSM = software Distributed Shared Memory Inv = invalidate protocol Mig = migratory protocol (scheme) Remote = remote memory access Replicate = page replication Comp = competitive update protocol CC = cost comparison TD = totally distributed Table 1. Adaptive Protocols. 2 Basic Adaptive Protocol Adaptive migratory scheme is based on [9] in which each node can independently choose an invalidate protocol or a competitive update protocol for each copy of a page. The basic adaptive protocol [9] is summarized in this section: 1. Collect statistics over a \sampling period". (Accesses to each memory page are divided into sampling periods.) 4

5 2. Using the statistics, determine the protocol that minimizes the \cost" for each page P. 3. Use the minimum cost protocol for each page P to maintain consistency of page P over the next sampling period. 4. Repeat above steps. Essentially, the proposed implementation would use statistics collected during current execution to predict the optimal consistency protocol for the near-future. This prediction should be accurate, provided that the memory access patterns change relatively infrequently. In [9], we present an adaptive scheme that chooses between the invalidate protocol and the competitive update protocol [7, 5, 6]. The competitive update protocol is dened by a \threshold" parameter; we will rename the threshold as the \limit". The proposed adaptive scheme collects run-time data on number and size of messages; the data is used periodically to determine the new value of limit for each copy of a page. The protocol is completely distributed in that each node independently determines the limit to be used for each page it has in its local memory. (Thus, dierent nodes may choose dierent limits for the same page.) The objective of the protocol [9] is to minimize the \cost" metric of interest. Two cost metrics considered are: (i) number of messages, and (ii) amount of data transferred. Let us focus on the accesses to a particular page P as observed at a node A. These accesses can be partitioned into \segments". A new segment begins with the rst access by node A following an update to the page by another node. (Segments are dened from the point of view of each node. Therefore, for the same page, dierent nodes may observe dierent segments.) We evaluate the cost for a segment, in which the number of updates by other nodes is U, for invalidate protocol (or competitive update protocol with limit L = 0) and update protocol (or competitive update protocol with limit L = 1), when the number of updates by other nodes in the segment is U. Critical value of the number of updates by other nodes, U critical, where limit L = 0 and limit L = 1 require the same cost in a segment, is computed. If U > U critical, invalidate protocol has a lower cost. If U < U critical, update protocol performs better. We choose competitive update protocol with limit L = 3 instead of update protocol when U < U critical because competitive update protocol performs better 5

6 than update protocol in general. With the modications, the basic adaptive scheme that attempts to minimize the \cost" can be summarized as follows: Each node collects data over a \sampling period" for each local page, and estimates the average value of U. At the end of the sampling period, if U U critical then the invalidate protocol (L = 0) is chosen for the next sampling period for that page, otherwise, the competitive update protocol is chosen. 3 Adaptive Migratory Scheme 3.1 Migratory Sharing In migratory sharing, a page is accessed by a single node at any given instance. A page is modied within a critical section to maintain mutual exclusion. Every access for a page is ordered by a sequence of acquire, shared memory access, and release. Migratory sharing is formally dened in [14] by using a regular expression: (R i )(R i ) (W i )(R i + W i ) (R j )(R j ) (W j )(R j + W j ) ::: In the above regular expression, R i and W i denotes a read and a write, respectively, by node i. To detect migratory sharing, following conditions can be checked [4, 14, 12]: 1. i 6= j, writing node is not the same as the previous writer 2. The number of cached copies is two (for [4, 14]) or node i and j are the only one that read a block since node i write (for [12]). 3.2 Motivation Previous works [4, 12, 14] for migratory sharing are designed for bus-based multiprocessor and/or directory-based cache coherence multiprocessor. In these architectures, global state 6

7 (number of cached copies, last invalidator of a block) can be known by some or all nodes. However, these schemes can not be used directly in the DSM where no actual owner exists such as in Munin [2] or Quarks [8]. No node has global information about copies of a page. New adaptive scheme for migratory sharing needs to be designed for the software DSM (or similar architecture) in which memory coherency is maintained in totally distributed manner. Each node needs to detect the migratory memory access patterns with local information only. Our scheme is implemented based on [9] in which each node can independently choose an invalidate protocol or a competitive update protocol for each copy of a page. However, [9] can not adapt to a migratory memory access pattern. Proposed adaptive protocol includes migratory protocol as one of the choices in addition to invalidate and competitive update protocols. In a migratory memory access pattern, our previous adaptive scheme [9] will usually choose invalidate protocol if the number of nodes (N) is not quite small ((U N? 1) U critical ). Consider the following scenario in a migratory memory access pattern: 1. Only node A has a copy of page P. 2. Node B requests page P on a read miss and receives it from node A. 3. Node B writes page and sends an update message to node A. 4. Node A invalidates the copy of page P on receiving the update message from node B (node A selected an invalidate protocol for page P ). Protocol overhead can be reduced if we can anticipate that node A will receive an update message from node B before node A accesses page P again. Cost Analysis (Number of Messages) We now consider number of messages as the cost metric. Two messages can be saved (sending an update message and a responding message) by self-invalidation after sending a copy of page P to node B. If memory access pattern is migratory sharing, the number of messages required for a migratory protocol in one segment (M migratory ) is computed as follows: 7

8 M migratory = F + 2; where F is the average number of times the request for the page is forwarded (due to dynamic distributed ownership algorithm) before reaching a owner of the page. At the beginning of each segment, a page fault occurs which requires F messages for forwarding page request, and two additional messages for receiving the page and sending acknowledgement. The number of messages required for an invalidate protocol (M invalidate ) is F + 4 [9]. (Invalidate protocol requires two more messages than migratory protocol for receiving an invalidation message and sending a negative acknowledgement.) Figure 1 shows an analytical comparison of required number of messages for one segment in migratory memory access patterns (we assume that F = 4, and L = 3 for competitive update protocol). Note that only the cost for memory access (read and write) is considered (cost for synchronization, acquire, is not considered). A migratory protocol requires less number of messages than an invalidate protocol by reducing two messages as described before. This gure suggests that the migratory protocol be the best choice when U 4 in the migratory memory access pattern, and the migratory protocol be included as one of choice of our adaptive protocol. However, even in the migratory memory access pattern, update and competitive update protocols are the best choice if U 2. This special case happens when 2 or 3 nodes access a shared memory in migratory pattern. When U = 3, three protocols (migratory, update, and competitive update protocols) require the same number of messages. Cost Analysis (Amount of Data Transferred) In the above analysis, we consider the number of messages as the cost. Now, we consider the amount of data transferred as the cost metric. If memory access pattern is migratory, the amount of data transferred for a migratory protocol in one segment (D migratory ) is computed as follows: D migratory = (F + 1) p control + p page ; where p control is the size of a control message (page request, acknowledgment of update, etc.), and p page is the size of a message that is required to send a page from one node to 8

9 Number of Messages invalidate update competitive migratory Number of Updates (U) in a Segment Figure 1: Number of Messages per Segment (in Migratory Memory Access Pattern) another. The amount of data transferred for an invalidate protocol (D invalidate ) is p update + (F + 2) p control + p page, where p update is the average size of an update message [9]. (At the beginning of each segment, page fault occurs which requires data transfer of size F p control for forwarding page request, p page for receiving page, and p control for sending acknowledgement. After that, data transfer of size p update + p control is required for receiving update message and sending negative acknowledgement.) 3.3 Implementation From the above motivation, we add two features to our adaptive migratory scheme: 1. Detect migratory memory access pattern: Node A collects statistics over \sampling period" to detect a local copy of page P expected to be invalidated before node A accesses page P. Node A starts to count the number of updates by other nodes after sending page P to some other node, until node A accesses page P again. If the number of updates (between node As' sending page P and node As' rst accessing page P ) for 9

10 each segment in one sampling period is greater than or equal to U critical ) in node A, then it may require less cost to self-invalidate local copy of page P than to maintain local copy of page P because local copy of page P is expected to be invalidated before node A accesses page P (migratory sharing). 2. Self-Invalidation: Node A performs self-invalidation for a local copy of page P on page sending to other node if node A detects the migratory memory access pattern for the page P in step 1. Our adaptive scheme can enter migratory mode and exit from migratory mode by the following algorithm in each node for each page: page_reply: send page; // when a page request is received from another node pagesent = 1; if (mig < 0 && segment == 0) // new sampling period mig = 0; reset page access permission; // reset page access permission (to detect local access // before receiving update message from another node) if (migratory protocol) // migratory protocol has been used self-invalidate page; page_fault: if (page does not exist) request_page; if (updatecnt > 0) { segment++; updateinsampling += updatecnt; if (updatecnt >= U_critical && mig >= 0 && pagesent) mig++; // still seems to be migratory pattern else 10

11 mig = -1; // does not seem to be migratory pattern if (segment == Ns) { // end of sampling period if (updateinsampling / segment >= U_critical) { if (mig == Ns) // deemed to be migratory in all segment choose migratory protocol; else choose invalidate protocol; } else choose competitive update protocol; segment = 0; updateinsampling = 0; mig = -1; } } else // performs local access between page sending mig = -1; pagesent = 0; // and receiving update message Refer to [9] for the detailed implementation to count updatecnt and segment, etc. On page reply, counter mig resets at the new sampling period (segment = 0). The counter mig is used to check whether the number of updates by other nodes between page sending (page reply) and rst local access is greater than or equal to U critical for all segments in a sampling period in node A. If local copy of a page has already been in migratory mode, then self-invalidate local copy. Flag pagesent is used to detect the rst local access after page sending. On page fault, (1) send request page message if a copy of the page does not exist, (2) check the number of updates. If the number of updates is not less than threshold value (U critical ) and pagesent sets (the rst local access after page sending), increase mig (migratory access pattern is expected and is optimal protocol); otherwise, mig sets?1. When 11

12 one sampling period ends (segment = N s ), choose appropriate protocol: (1) If the average number of updates is less than the threshold value (U critical ), a competitive update protocol is chosen; (2) If the average number of updates is not less than the threshold value (U critical ), an invalidate protocol or a migratory protocol is chosen according to mig value (if mig = N s, a migratory protocol is chosen; else, an invalidate protocol is chosen). Above scheme for selecting migratory protocol is very strict: 1. Page sending is required in all segments of a sampling period and the number of updates between page sending and the rst local access should always be not less than critical value (U critical ); mig = N s. 2. The average number of updates per segment should not be less than critical value (U critical ). If node A does not send page P in any one of segments of a sampling period, a migratory protocol is not chosen (condition (1)). Condition 2 may not be necessary because condition 1 is tighter than condition 2. (If condition 1 is satised, condition 2 is also satised.) Adaptive migratory scheme can be summarized as follows: 1. Choose a competitive update protocol if U < U critical, otherwise choose an invalidate protocol or a migratory protocol. If a competitive update protocol is chosen, nish protocol selection; otherwise go to step (2). 2. Choose a migratory protocol if the rst condition (mig = N s ) for selecting migratory protocol is satised, otherwise, choose an invalidate protocol. Figure 2 shows examples how the appropriate protocol is chosen according to the memory access patterns for page P in node A. We assume that the sampling period (N s ) is 2 and the threshold value of the number of updates (U critical ) is 4. In the rst scenario (Figure 2 (a)), a competitive protocol is chosen because the average the number of updates of a sampling period (U avg ) is less than the threshold value (U avg = = 1:5 < (U critical = 4)). In the third scenario (Figure 2 (c)), a migratory protocol is chosen because (1) the average number of updates of a sampling period is not less than the critical value (U avg = = 4 (U critical = 4)) and (2) the number of update between page sending and the rst local 12

13 access is not less than the critical value (U critical = 4) in all segment of the sampling period. In the second scenario (Figure 2 (b)), however, an invalidate protocol is chosen because the number of update between page sending and the rst local access is 0 (less than the critical value (U critical = 4)) in the rst segment, and node A does not send page P in the second segment (any one condition is sucient to choose invalidate protocol). 4 Performance Evaluation Experiments are performed to evaluate the performance of the adaptive DSM by running applications on an implementation of the adaptive protocol. We implemented the adaptive protocol by modifying another DSM, named Quarks (Beta release 0.8) [1, 8]. This section presents the experimental results. We evaluated the adaptive scheme using synthetic applications (qtest, ProdCons, and Reader/Writer) as well as other applications (Floyd-Warshall, SOR, QSORT, IS, Matmult, and Gauss-Jacobi). qtest is a simple shared memory application based on a program available with the Quarks release [8]: all nodes access the shared data concurrently. A process acquires mutual exclusion before each access and releases it after that. We measured the cost (i.e., number of messages and size of data transferred) by executing dierent instances of the synthetic application, as described below. Floyd-Warshall, QSORT, IS, and Gauss-Jacobi were developed at Texas A&M University. SOR and Matmult are available with the Quarks release [8]. ProdCons and Reader/Writer are based on qtest. Sampling period (N s ) is chosen to be 2 for all applications. Results for qtest Application The body of the rst instance of the qtest program (named qtest1) is as follows: qtest1: repeat NLOOP times { acquire(lock_id); for (n = 1 to NSIZE) shmem[n]++; // shared memory access release(lock_id); 13

14 remote access by other nodes Page P local access remote access by other nodes Page P local access remote access by other nodes Page P local access.. A read.. A read A write.. A read A write A write Segment 1 B write C write send P A read B write C write B write D write. A read A read send P Segment 1 B write C write B write D write send P Segment 1 (a) (b) (c) B write... A write B write C write D write B write... A read A write Segment 2 B write C write D write B write... A write send P A read Segment 2 Segment 2 Figure 2: Protocol Selection (N s = 2, U critical = 4) select competitive A read select invalidate A read select migratory 14

15 } Each node performs the above task. All the shared data accessed in this application is conned to a single page. Each node executes the repeat loop 300 times, i.e., NLOOP = iterations were sucient for the results to converge. The size of shared data (N SIZE) is 2048 bytes { all in one page { page size being 4096 bytes. (The next experiment considers small NSIZE.) The adaptive protocol initializes L to 3 for each page at each node. At the end of each sampling period (N s = 2), each node estimates U and p update (the average size of message for update) for the page and selects the appropriate L { this L is used during the next sampling period. For this application, Figures 3 and 4 show the measured cost by increasing the number of nodes (N). The costs are plotted per \transaction" basis. A transaction denotes a sequence of operations { namely, acquire, shared memory access, and release { in one loop of the qtest1 main routine. The curve for the adaptive scheme in Figure 3 is plotted using the heuristic for minimizing the number of messages; the curve in Figure 4 is plotted using the heuristic for minimizing the amount of data transferred. In Figure 3, the curve named \protocol" denotes the number of messages required by the specied protocol, and \#update" denotes the average number of updates per segment (U) calculated over the entire application. \adaptive" denotes the scheme in [9]. \adaptive+" denotes adaptive migratory protocol. As number of nodes N increases, the average number of updates per segment (U) increases proportionally. Adaptive migratory protocol performs best because qtest1 shows the migratory memory access pattern. The adaptive migratory protocol requires approximately 2 less messages per transaction than the adaptive protocol, which chooses invalidate protocol. However, the adaptive migratory protocol requires the same number of messages as the adaptive protocol when N 4 (U U critical ) because both protocols choose competitive update protocol (L = 3). The cost graph for the adaptive migratory protocol (also invalidate protocol) is not at while it is at in the cost analysis as shown in the Figure 1. The reason is that the cost for synchronization (acquire) increases as the number of node (N) increases (however, the cost for synchronization, which is not shown in the Figure 1, will be approximately same for all protocols for the same number of node (N). 15

16 Figure 4 shows the comparison of the amount of data transferred per transaction. Since qtest1 application modies large amount of data (N SIZE = 2048 bytes), an update protocol requires larger amount of data transfer as the number of nodes (N) increases. However, an invalidate protocol requires nearly constant amount of data transfer (per transaction) for all N. Adaptive migratory protocol chooses the appropriate protocol including migratory protocol for all values of N, thereby minimizing the amount of data transferred. The second experiment was performed with the main loop (qtest2) shown below: qtest2: repeat NLOOP times { acquire(lock_id); if (random() < read_ratio) // 0 <= random <= 1 for (n = 1 to NSIZE) r_value = shmem[n]; // read else for (n = 1 to NSIZE) shmem[n] = w_value; // write release(lock_id); } All the shared data accessed in qtest2 is conned to a single page. For this experiment, we assume a small amount of shared data access per iteration of the repeat loop (NSIZE = 4). Additionally, each iteration of the repeat loop either reads or writes the shared data depending on whether a random number (random()) is smaller than the read ratio or not. This allows us to control the frequency of write accesses to the shared data. 8 nodes access the shared data 100 times each (N LOOP = 100). (We observed that the results converge quite quickly.) Figure 5 presents the number of messages per transaction (i.e., acquire, shared memory access, and release). Adaptive migratory protocol requires less number of messages than the adaptive protocol when read ratio is less than 20 % because qtest2 tends to show migratory memory access pattern at the low read ratio. Figure 6 shows the comparison of the amount of data transferred per transaction. Since qtest2 application modies small amount of data (N SIZE = 4 bytes), both adaptive protocol 16

17 Number of Messages per Transaction invalidate update competitive adaptive adaptive+ #updates Number of Node (N) Figure 3: qtest1: Average Number of Updates (U) and Messages per Transaction Amount of Data Transferred per Transaction invalidate update competitive adaptive adaptive Number of Node (N) Figure 4: qtest1: Amount of Data (Bytes) Transferred per Transaction 17

18 and adaptive migratory protocol choose a competitive protocol with large update limit (L). Therefore, the two adaptive protocols require small amount of data transfer. Results for Other Applications We now evaluate our adaptive scheme by executing eight additional applications (Floyd- Warshall, SOR, ProdCons, QSORT, IS, Reader/Writer, Matmult, and Gauss-Jacobi) on 8 nodes. Floyd-Warshall is all-pair-shortest-path algorithm. (We use 128 vertices as input.) SOR is Successive Over-Relaxation algorithm which executes simple iterative relaxation algorithm. (We use grid.) ProdCons is implementation of a simple Producer/Consumer model. Producers make data which will be used by consumers. (We execute total 1,600 \transactions" for ProdCons. A transaction denotes a sequence of operations { namely, acquire shared memory access and release { similar to as dened in qtest.) QSORT is Quick Sorting algorithm. (We use 65,536 elements to be sorted.) IS is Integer Sorting algorithm. (We use 3,200 keys of 100 range.) Reader/Writer is implemented by modifying the qtest to evaluate performance in time-varying memory access patterns. Execution time is divided into 4 stages and memory access pattern is dierent for each stage. A node can be either a reader or a writer for each page depending on the execution stage. The size of data for write is dierent for each stage. (Total 1,920 transactions are executed.) Matmult is a matrix multiplication program which compute A n. (We compute A 10, where A is a matrix.) Gauss-Jacobi is a linear system solver by using iteration method. (We solve a linear system of size 128.) Figure 7 and 8 show performance comparisons in each cost metric (the number of messages or the amount of data transferred). Costs are normalized to the base the protocol of the maximum cost for each application. Floyd-Warshall, SOR, Matmult, and Gauss-Jacobi use barriers for synchronization. In these types of applications, adaptive migratory protocol (+) does not show performance improvement over the adaptive protocol { similar to the adaptive protocol { because Floyd-Warshall, SOR, Matmult, and Gauss-Jacobi do not show the migratory memory access pattern. ProdCons and QSORT use lock/unlock for a task queue, IS uses lock/unlock for ranking, and Read/Writer uses lock/unlock for exclusive object access. These applications show 18

19 Number of Updates (U) or Messages invalidate update competitive adaptive adaptive+ #updates Read Ratio Figure 5: qtest2: Average Number of Updates (U) and Messages per Transaction Amount of Data Transferred per Transaction invalidate update competitive adaptive adaptive Read Ratio Figure 6: qtest2: Amount of Data (Bytes) Transferred per Transaction 19

20 migratory memory access patterns. (In Read/Writer, some page show migratory memory access patterns.) In three applications (ProdCons, IS, and Read/Writer), adaptive migratory protocol (+) requires the least number of messages. However, in QSORT, adaptive migratory protocol does not show performance gain over the invalidate protocol because a element list to be sorted may not show migratory memory access pattern in a page granularity. We evaluated the performance of our adaptive protocol on a synthetic Reader/Writer application where memory access patterns (read to write ratio, access period, amount of data written in each transaction, etc.) are time-varying. Results show that the adaptive protocol performs well by adapting to time-varying memory access patterns. Adaptive migratory protocol performs better than the adaptive protocol because migratory memory access pattern is performed in some page. Experimental results show that our adaptive migratory scheme performs well in migratory sharing. This results suggest that our adaptive migratory scheme can predict the migratory sharing when memory access patterns do not change frequently. 5 Conclusion and Future Work Our objective is to design an adaptive migratory scheme for DSM that can adapt to timevarying pattern (including migratory sharing) of accesses to the shared memory. Our approach continually gathers statistics, at run-time, and periodically determines the appropriate protocol for each copy of each page. The choice of the protocol is determined based on the \cost" metric that needs to be minimized. The cost metrics considered in this paper are number and size of messages required for executing an application using the DSM implementation. Our adaptive approach determines, at run-time, the cost of each candidate consistency protocol, and uses the protocol that appears to have the smaller cost. The proposed adaptive approach is illustrated here by means of an adaptive migratory scheme that chooses either migratory, invalidate, or competitive update protocol for each copy of a page { the choice changes with time, as the access patterns change. The paper presents experimental evalua- 20

21 Normalized Number of Messages Floyd-Warshall SOR ProdCons QSORT Isort Read/Writer Matmult Gauss-Jacobi Figure 7: Cost Comparisons (Number of Messages) 100 Normalized Amount of Data Floyd-Warshall SOR ProdCons QSORT Isort Read/Writer Matmult Gauss-Jacobi Figure 8: Cost Comparisons (Amount of Data Transferred) 21

22 tion of the adaptive migratory scheme using an implementation based on Quarks DSM [8]. Experimental results from the implementation suggest that the proposed adaptive approach can indeed reduce the cost. Further work is needed to fully examine the eectiveness of the proposed approach: The cost metrics considered in the paper are number and size of messages. Other cost metrics need to be considered. In particular, impact of our heuristics on application execution time needs to be evaluated. The adaptive approach (based on cost-comparison) presented here can be combined with ideas developed by other researchers (e.g., [13]) to obtain further improvement in DSM performance. As yet, we have not explored this possibility. Acknowledgements We thank John Carter and D. Khandekar at the University of Utah for making Quarks source code available in public domain, and Akhilesh Kumar for the Floyd-Warshall source code. References [1] J. Carter, D. Khandekar, and L. Kamb, \Distributed shared memory: Where we are and where we should be headed," in Proc. of the Fifth Workshop on Hot Topics in Operating Systems, pp. 119{122, May [2] J. B. Carter, Ecient Distributed Shared Memory Based On Multi-Protocol Release Consistency. PhD thesis, Rice University, Sept [3] A. Cox and R. Fower, \The implementation of a coherent memory abstraction on a numa multiprocessor: Experience with platinum," in Proc. of the 12th ACM Symposium on Operating Systems Principles, pp. 32{44, [4] A. Cox and R. Fowler, \Adaptive cache coherency for detecting migratory shared data," in Proceedings of the 20th Annual International Symposium on Computer Architecture, pp. 98{108, May [5] F. Dahlgren and P. Stenstrom, \Using write caches to improve performance of cache coherence protocols in shared-memory multiprocessors," Journal of Parallel and Distributed Computing, vol. 26, pp. 193{210, Apr

23 [6] H. Grahn, P. Stenstrom, and M. Dubois, \Implementation and evaluation of updatebased cache protocols under relaxed memory consistency models," Future Generation Computer Systems, vol. 11, pp. 247{271, June [7] A. Karlin, M. Manasse, L. Rudolph, and D. Sleator, \Competitive snoopy caching," in Proc. of the 27'th Annual Symposium on Foundations of Computer Science, pp. 244{254, [8] D. Khandekar, \Quarks: Portable dsm on unix," tech. rep., University of Utah. [9] J.-H. Kim and N. H. Vaidya, \A cost-comparison approach for adaptive distributed shared memory," in ACM International Conference on Supercomputing (ICS), pp. 44{ 51, May [10] R. LaRowe, C. Ellis, and L. Kaplan, \The robustness of numa memory management," in Proc. of the 13th ACM Symposium on Operating Systems Principles, pp. 137{151, [11] A. Lebeck and D. Wood, \Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors," in Proceedings of the 22nd Annual International Symposium on Computer Architecture, To appear. [12] H. Nilson and P. Stenstrom, \An adaptive update-based cache coherence protocol for reduction of miss rate and trac," tech. rep., Lund University. To appear in Parallel Architectures and Languages Europe, July [13] U. Ramachandran, G. Shah, A. Sivasubramaniam, A. Singla, and I. Yanasak, \Architectural mechanisms for explicit communication in shared memory multiproccessors," in Supercomputing `95, Dec [14] P. Stenstrom, M. Brorsson, and L. Sandberg, \An adaptive cache coherence protocol optimized for migratory sharing," in Proceedings of the 20th Annual International Symposium on Computer Architecture, pp. 109{118, May

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu