Minimizing the Directory Size for Large-scale DSM Multiprocessors. Technical Report

Size: px
Start display at page:

Download "Minimizing the Directory Size for Large-scale DSM Multiprocessors. Technical Report"

Transcription

1 Minimizing the Directory Size for Large-scale DSM Multiprocessors Technical Report Department of Computer Science and Engineering University of Minnesota EECS Building 200 Union Street SE Minneapolis, MN USA TR Minimizing the Directory Size for Large-scale DSM Multiprocessors Jinseok Kong, Pen-chung Yew, and Gyungho Lee January 09, 1999

2

3 Minimizing the Directory Size for Large-Scale DSM Multiprocessors Jinseok Kong, Pen-Chung Yew, Department of Computer Science University of Minnesota Minneapolis, MN , USA fjkong, Gyungho Lee Division of Engineering University of Texas San Antonio, TX , USA Abstract Directory-based cache coherence schemes are commonly used in large-scale distributed sharedmemory (DSM) multiprocessors, but most of them rely on heuristics to avoid large hardware requirements. We proposed to use physical address mapping on the memory modules to signicantly reduce the directory size needed. This approach allows the size of directory to grow as O(cn log 2 n) as in optimal pointer-based directory schemes [10], where n is the number of nodes in the system and c is the number of cache lines in each cache memory. Our simulations show that the proposed scheme can achieve a performance equivalent to the heuristic pointer-based directory schemes, but with a smaller directory size and a simpler pointer management scheme. Key words { cache coherence, directory protocol, distributed shared memory multiprocessor, memory architecture. 1

4 1 Introduction A major challenge to building a large-scale distributed shared memory (DSM) multiprocessor is to provide the system with a cost-eective cache coherence protocol. Snooping protocols [4, 9, 19] are not feasible on such systems because of their need to perform extensive broadcast in the network. Directory-based protocols which support point-to-point communications between nodes have long been proposed for such systems [5], and have also been implemented in some recent DSM multiprocessors [1, 14, 15, 16, 17]. Although the directory-based protocols increase their complexity, the distributed nature of the protocols makes them a potential choice even for systems with a large number of processors. However, the infamous scalability of the full-map directory size [5] poses a serious impediment for such protocols to be used in large systems. Many pointer-based directories [2, 6, 10, 18] have been proposed to reduce the directory size and to make their hardware cost more manageable. Most of the proposed pointer-based schemes rely on heuristics to scale with the increase of the system size. Although the heuristic schemes can reduce the memory overhead much below the memory size, they still require a large amount of memory compared to the cache memory size [18]. Furthermore, the heuristic schemes could increase the complexity of the directory protocols. This could raise the issue of stability for the system. Another consideration in the directory size is whether the directory schemes are scalable with the increase of the memory size. As a matter of fact, due to the rapid progress in VLSI technology, the DRAMs have quadrupled in capacity every three years since 1980's, which is more than the increase in the system size (e.g. Cray X-MP has 4 processors in around 1985 and Cray T3E can have 2048 processors in around 1997). Hence, even without scaling up the system size signicantly, the directory size could still increase signicantly if the memory size is substantially increased as in most of the recent systems. In this paper, we propose a physical address mapping scheme on the memory modules to control the number of cache lines which can be cached in each cache memory. More specically, we use the data interleaving scheme on the memory modules to group all of the cache lines which are mapped to the same cache set, and then assign the group of the cache lines to the same memory module. In a direct-mapped cache, only one of the cache lines in this group can reside in a particular cache 2

5 P Cache Node 0 Node n-1 P Cache Memory Module Directory Directory Memory Module Network Figure 1: Directory-based distributed shared memory (DSM) multiprocessor memory at any instance of time. Using this property, we can reduce the size of the pointer-based directories and allow it to scale well with both the system size and the memory size. To simplify our presentation, we assume a generic cache-coherent DSM multiprocessor (see Figure 1). Each system node consists of a processor with its own cache memory and a memory module which is a part of the main memory. These system nodes are connected by a point-to-point interconnection network. The paper is organized as follows: In Section 2, we provide the necessary background to motivate our study. Section 3 describes our proposed directory organization and its associated protocol. The implementation issues for the directory scheme are also discussed in this section. The simulation results are presented in Section 4, and Section 5 summarizes the results with some conclusions. 2 Background The storage requirement of a basic full-map directory [5] in each node is proportional to O(mn) for a system with n nodes, where m is the number of the cache lines in each memory module. This growth makes the full-map directory very expensive in a large DSM multiprocessor. Since the total cache memory size is much smaller than the total main memory size, and the data sharing is often limited to a small number of processors [2, 7, 21], most of the bits in the full-map directory are wasted. One key observation here is that the number of directory bits set at any time is bounded by the total number of cache lines in the cache memories, regardless of how the data is shared. To take advantage of this observation and to improve the scalability of the directory size, some 3

6 directory entries for cache lines head of free list (HFL) shared pointer pool Figure 2: Some linked lists in a directory using shared pointer pool pointer-based directory schemes have been proposed. For example, the size of the chained directory in each node as proposed in [10] grows as O(c log 2 n), where c is the number of cache lines in each cache memory. This is much smaller than O(mn) as in the full-map directory. However, the chained directory signicantly increases the protocol complexity and the data access latency because the directory information is distributed over the cache memories as linked lists. Several other pointer-based directories have also been proposed [2, 6, 18]. The limited-pointer directory scheme [2] tries to reduce the directory size by limiting the number of pointers allowed for each cache line in the main memory. An expensive broadcast scheme is used if excessive data sharing causes the number of pointers to exceed that limit. The LimitLESS scheme [6] reduces the broadcasting overhead by overowing the pointers to full-map bit vectors. The dynamic pointer allocation scheme [18] maintains a pointer pool of a xed size for each memory module. A linked list of pointers is maintained for each cache line which has a copy in the cache memories. Each of the pointers in the linked list points to the node which contains a copy of the cache line. Figure 2 shows an example of such linked lists in a node. One linked list shows that nodes and have a copy of the cache line , and another linked list shows that node has a copy of the cache line When the needed pointers exceed the total number of pointers in the shared pointer pool, some pointers are selected and \victimized" by invalidating their corresponding cache copies to free the pointers for new assignment. The performance could suer signicantly if the pointer pool is too small and the pointer overow occurs too often. Heuristics shows that the pointer overow could be rare with a pointer pool size less than 7% of the memory size. However, the number of pointers 4

7 page address (I p) line address (I ) l Index for nodes (I n) I n Index for sets (I s) (a) I l (b) I s I b byte offset (I b ) Memory module 000 physical address Memory module 001 physical address Memory module 111 physical address (c) set # Cache Figure 3: (a) A general data interleaving scheme on memory modules; (b) An example to show all of the cache lines in the memory module which are mapped to the cache set in the cache memory; (c) Every memory module has 4 cache lines mapped to the cache set still needs to be much (8 to 16 times) larger than the number of the cache lines in the cache memory [18]. Further, if the possibility of the pointer overow is not zero, the directory protocols should have to cope with the issue of the pointer overow. This could be critical on stability of the system. Our proposed scheme tries to avoid most of those problems. 3 The Proposed Scheme To simplify our presentation, let us rst look at some general data interleaving schemes on the main memory modules, and study their eect on the data mapping in the cache memories. Assume the unit of our data interleaving is one page, i.e., we allocate one page to each memory module in a round-robin fashion across all memory modules. Figure 3 (a) shows the assignment of the address elds in a physical byte address using such an interleaving scheme. The address eld I l species the cache line number, I b species the byte oset in a cache line, I p species the page number, and I n species the node number, i.e., the home node of a cache line. To simplify our 5

8 presentation, let us further assume that the cache is direct-mapped, and its size is the same as the page size. For a direct-mapped cache, the address eld I s is used as the index to the cache line within a cache memory. Using such an interleaving scheme, an entire page (or any portion of a page) can reside in any cache memory. If we study this data interleaving scheme more carefully, we can observe that, 2 I l?i s?i n (or 2 Ip?In ) of the cache lines in each memory module are mapped to the same cache line in each cache memory. To see this, in Figure 3 (a), the node number of a particular memory module and the set number of a particular cache line in its cache memory are highlighted as the shaded address elds. For a selected node and a particular cache line in the cache memory, the address bits in those shaded elds are xed. Only the remaining address bits in the non-shaded address eld, whose length is (I l? I s? I n ) bits, can be varied. Each of those addresses corresponds to a cache line in the selected memory module which is mapped to the same cache line in the cache memory. Figure 3 (b) shows an example with a 10-bit byte address and a 2-byte cache line. In the gure, the selected memory module (I n ) is 000 2, and the selected cache set (I s ) is The table lists all of the cache lines in the memory module that are mapped to the same cache set in the cache memory. Since (I l? I s? I n ) has two bits, we have four dierent cache lines mapped to the same cache set Figure 3 (c) shows those four cache lines in the memory module Figure 3 (c) also shows that all the other memory modules will have four cache lines mapped to the same cache set There are a total of 32 such cache lines since there are 8 memory modules in our example. It is important to note that, at any instance of time, only one of these 4 cache lines in each memory module (32 cache lines in total) can reside in a particular cache memory. Bringing in any of the other cache lines in this group will create a conict miss in that direct-mapped cache memory. With this important observation, we can calculate the maximum number of pointers needed in the pointer pool for each memory module using the dynamic pointer scheme [18] (described in Section 2) without causing pointer overow. In each memory module, the I n eld is xed, and only one cache line in the group specied in the (I l? I s? I n ) eld can reside in any cache memory, and there are 2 Is groups. Therefore, 6

9 page address (I p) line address (I ) l byte offset (I b ) Memory module 000 physical address Index for sets (I s) Index for nodes (I n) (a) I l I s I b Memory module 010 physical address Memory module 111 physical address set # Cache I n (b) (c) Figure 4: (a) An improved data interleaving scheme on memory modules; (b) An example to show all of the cache lines in memory modules (32 in each memory module) which are mapped to the cache set in the cache memory; (c) Our proposed data interleaving scheme will place all 32 cache lines in the same memory module we need a maximum of (2 In 2 Is ) pointers in each memory module 1. This worst case happens when all of the cache memories have all of their cache lines from the same memory module. Notice that 2 Is = c is the total number of cache lines in each cache memory, and 2 In = n is the total number of system nodes. The maximum number of pointers required in the pointer pool without \victimizing" any existing pointer is O(cn) (or O(cn log 2 n) in bits) for each memory module, which grows proportionally to the cache size. In contrast, the basic full-map directory scheme grows proportionally to the memory size O(mn). To see if we could reduce the number of needed pointers even further, let us change the interleaving scheme as follows. As shown in Figure 4 (a), we now interleave only a portion of a page specied in the address eld (I s? I n ) across all of the memory modules. The node number eld I n occupies the most signicant portion of the I s eld (as opposed to being on the outside 1 If the cache set associativity is a, at most a cache lines in the group can reside in the cache memory. Thus, the maximum number of pointers is a 2 In 2 Is. 7

10 of the I s eld as in Figure 3 (a)). In this interleaving scheme, a page is now spread across all memory modules instead of being placed entirely in one memory module. Figure 4 (b) shows the same example as in Figure 3 (b), but now we have 32 cache lines in memory module that are mapped to the same cache set in the cache memory. In a sense, the new interleaving scheme places all 32 cache lines, which used to be spread across all 8 memory modules, as shown in Figure 3 (c), to the same memory module 010 2, as shown in Figure 4 (c). Similar to the earlier argument, only one of these 32 cache lines can reside in cache set at any instance of time. Bringing in any of the other cache lines in this group will create a cache conict miss. Based on this observation, we can again estimate the maximum number of pointers needed in each memory module as follows. The rst interesting observation is that, since the eld for the node number I n is part of the address eld for the cache line number I s, all of the cache lines in a particular memory module (in which the eld I n has a xed value, i.e., its node number) can only occupy a specic portion of the cache memory (the part where the I n eld has the value of its node number). In other words, the addresses in each memory module can no longer be mapped to the entire cache memory as in the previous interleaving scheme, but to only a portion of the cache memory. The cache memory is, in eect, being divided into equal portions: one for each memory module, as shown in Figure 4 (c). Because of such a reduction in the mappable cache memory space for each memory module, more cache lines from a memory module are being mapped to the same cache set. Again, from Figure 4 (a), for a selected memory module and a selected cache set (in which the value of the I s eld is xed, and I n is a part of I s ), we have (2 I l?i s ) (or 2 Ip ) cache lines in that memory module being mapped to the same cache set. It is a much larger group than the group in Figure 3 (a), but the number of groups in each memory module decreases from 2 Is to 2 Is?In. The maximum number of pointers needed without having to \victimize" any pointer in each memory module is now 2 In 2 Is?In = 2 Is = c, i.e., it is proportional to the cache size c only (as opposed to cn in the previous case) 2. Let us use the Cray T3E [3] as an example to show how much reduction we could have using the above mentioned scheme assuming we organize the Cray T3E with directories. The Cray T3E can have up to 2048 processors. Each processor is a DEC Alpha chip [8] which has separate 2 If the set associativity is a, the maximum number of pointers is a 2 In 2 Is?In = a2 Is = c. 8

11 L1 8K-byte instruction and data caches, and an integrated 3-way set-associative L2 cache of 96K bytes (to simplify our discussion, we assume it has a direct-mapped L2 cache of 32K bytes). It can also have an optional 1M byte to 64M byte o-chip L3 cache (we assume a 2M-byte L3 cache). Each processor can have a memory module of up to 2G bytes. The cache line size for the L2 cache and the L3 cache can be either 32 bytes or 64 bytes. We will focus only on the L2 and L3 caches in our discussion, and assume the cache line size is 64 bytes. Using the full-map directory scheme for a Cray T3E with 1024 nodes, we will need a 1024-bit (i.e. 128 bytes) vector (1 bit for each node) for each cache line in the memory module. For a 2G byte memory module, there are 32M cache lines, hence, the directory size needed in each memory module is 128 bytes 32M, which is 4G bytes. Using the page interleaving scheme, each pointer will have 10 bits (approximated to 2 bytes) since there are 1024 nodes. The Cray T3E does not have a virtual memory. Hence, we assume the unit of the interleaving is 32K bytes if there is only an L2 cache (32K bytes = 2 9 lines), and 2M bytes if there is an L3 cache (2M bytes = 2 15 lines). Assume we ignore the other needed overhead to maintain a the pointer list, then the total number of pointers needed in a memory module without overowing the pointer pool is = 2 19, which is 1M bytes assuming there is only a L2 cache, and = 64M bytes if there is a L3 cache. Even though this is a substantial reduction in the directory size compared to the full-map directory scheme, it is still a signicant hardware overhead. Using our proposed interleaving scheme, however, the number of pointers needed in the pointer pool is 2 9 = 512 (1K bytes) and 2 15 (64K bytes), respectively, for the L2 and L3 caches. This is a very signicant reduction in the pointer pool size, and it makes the proposed scheme very workable even for a full-sized 1024-node Cray T3E. It is very important to note that the reduction in the mappable space in each cache memory for each memory module will not alter the cache miss ratio in the cache memory. This is because, even though we use dierent data interleaving schemes on the main memory modules, we did not change at all the original data layout within the physical address space, which the cache data mapping is based on. That is, the position of the I s eld is unchanged. The only thing changed is the location of the data in the memory modules, but not the cache behavior. However, dierent interleaving schemes will incur dierent amounts of memory conicts in the memory modules, and may also incur possible extra network trac to the remote memory modules which might have been local in 9

12 line address (I ) l page address (I p) byte offset (I b ) line address (I ) l page address (I p) byte offset (I b ) line address (I ) l page address (I p) byte offset (I b ) Index for sets (I s) Index for nodes (I n) (a) Index for sets (I s) Index for nodes (I n) (b) Index for L2 sets (I s) Index for L1 sets (I s) Index for nodes (I ) n (c) 2 1 Figure 5: Other data interleaving schemes on memory modules the original interleaving scheme. In fact, our simulations show very little dierence between the two address interleaving schemes since both schemes have nothing to do with locality of reference. We will discuss these issues using simulation results in Section 4. Nevertheless, the cache miss ratios should remain unchanged in each cache memory. Using a similar strategy, there could be many ways to interleave the data among the memory modules, each with dierent consequences in the total required pointers and the memory performance. One possible interleaving scheme is to straddle the I n eld between the I p and I s elds as shown in Figure 5 (a). These schemes tend to increase the required pointers, but our simulations show that it has very little impact on the overall performance (see Section 4). Another possible design consideration is the size of a page, which determines the length of the I p eld and its relationship to the cache size. The page size is usually smaller than the size of the L2 or L3 cache, and caches may have a set associativity larger than one. However, whatever the layout of I p and I s is, the size of the pointer pool depends only on the location of the I n eld relative to the I s eld. Figure 5 (b) shows the case in which the I n eld is a part of the I s eld, but is straddled over part of the I p eld. For systems with multi-level cache memories, if all of the cache memories at dierent levels satisfy the inclusion property, then we only need to consider the cache memory level which is farthest away from the processor, and is usually the largest in size. All of the discussions so far could still apply. However, if the inclusion property is relaxed [11, 12], then the maximum number 10

13 node number link memory module (a) head link HFL shared pointer pool (b) Figure 6: Directory organization of pointers required in each pointer pool will be the sum of the total cache lines at all levels if the I n eld is included in the I s eld of the L1 cache (I 1 s ). That is, if the total numbers of the cache lines in L1 and L2 caches are c 1 and c 2, respectively, and the I n eld stays within the I 1 s eld as shown in Figure 5 (c), the maximum number of pointers needed in each pointer pool will be c 1 + c 2. Again, dierent interleaving data sizes will have dierent consequences. However, studying all possible variations is beyond the scope of this paper. 3.1 Directory organization The management of the pointer pool used in our proposed scheme is very similar to the one proposed in [18]. The basic organization is shown in Figure 6. Each pointer will have two major elds: One stores the number of the node which has a copy of the cache line, and the other keeps a link which points to the next pointer in the non-circular singly linked list (see Figure 6 (b)). A free pointer list is maintained and its head is linked by a special register, called head of free list (HFL). To facilitate the access of the linked list associated with a cache line in the memory module, each cache line in the memory module could have a link (shown as \head link" in Figure 6 (a)) which points to the rst pointer of the associated linked list. The link is of the size log(c) bits, which is very small compared to the size of a cache line. However, the total size of the head links 11

14 can grow proportionately to the size of each memory module if each cache line in the memory module is associated with such a link. Even though this storage overhead is very small in practice, to avoid this extra overhead, we could hash the address of a cache line to nd the associated list in the pointer pool. It could make the pointer access a little more complicated with some extra time overhead if conicts occur during the hashing. However, it allows the directory size to scale only with the cache memory size instead of the memory size. In fact, we can optimize the hashing function by using the fact that only a n copies among 2 In?Is cache lines in a group are cached at any time (where a is the set associativity and n is the number of nodes). Thus, in the worst case, the number of the hashing conicts could be limited to a n. In Section 4, we study the eect of such pointer pool overhead on the overall performance. 3.2 Directory protocol and implementation Since our scheme is based only on the data interleaving on the memory modules, we could use many existing cache coherence protocols, for example, the protocol used in the full-map directory scheme [5]. That is, when a read request arrives at the memory module, the request is forwarded to the owner node where the request can get a valid copy of the requested cache line. After its completion, a pointer will be added to the linked list associated with the requested cache line. If it is an exclusive-read or a write operation, all of the copies pointed to by the linked list will have to be invalidated (or updated, depending on the protocol used). A new pointer is added to the linked list after the completion of the invalidation (or the update). It is pointed out in [18] that a change in the protocol for a write-back cache might be needed. It occurs when a \clean" cache line is being replaced. In a full-map directory scheme, the \clean" cache line can be replaced without informing the directory. Even though the bit corresponding to the replaced \clean" cache line will remain set in the directory, there could be no signicant side eect as long as data consistency is guaranteed. However, in the proposed scheme, the pointer corresponding to the replaced \clean" cache line will become \stale", and not returned to the free list. If the number of such \stale" pointers increases over time, the available free pointers will be depleted. To avoid such a situation, we need to force all of the replaced cache lines, either \clean" 12

15 or \dirty", to inform the directory and to free up the pointers associated with the replaced cache lines. Another possible modication needed is when we try to replace a cache line. We may want to fetch the new cache line rst and perform the replacement later to minimize the processor stall time. According to the proposed scheme, the replaced cache line and the line to be fetched are mapped to the same cache line, i.e., they should come from the same memory module. However, if we only provide c pointers in the pointer pool as proposed (where c is the number of cache lines in each cache memory), we implicitly assume that the replaced cache line will free up the pointer rst to allow the new cache line to use the pointer. If we reverse that order, it is possible that when the request for the new cache line arrives, it may nd that there is no longer a free pointer available in the free list. Since this situation will occur only temporarily, a small number of extra pointers could be added to avoid such temporary starvation. For instance, we could add n extra pointers assuming each processor will generate only one request for the new cache line at a time. However, if the replaced copy is available at the same time when a cache miss occurs, the extra pointers may not be necessary because we could send the missing request and the replaced copy to the same home node together. Usually, the address of the replaced copy is available without increasing the cache access time. Thus, the extra network trac needed for the replacement notication of the \clean" cache copy could be very minimal. There are many ways to implement the proposed scheme. For example, because the size of the pointer pool is very small, it can be implemented in SRAM, which is much faster than the DRAM of the memory module. If the size of the head links grows proportionately to the size of each memory module (see Figure 6), it could be implemented in DRAM for cost/performance design. The management of the linked lists in the pointer pool (insertion and deletion of pointers) can be carried out concurrently with the management of directory requests (decoding, forming, and sending requests). The pointer overhead can thus be partially overlapped. In a write operation, since the main overhead is in invalidating (or updating) the existing cache copies in the remote cache memories, the overhead in managing the pointer pool could be even less signicant. We address the impact of such overhead in Section 4. 13

16 Program Program Description Problem Size Barnes Barnes-Hut N-body simulation 16K particles Cholesky blocked sparse factorization d750.o FFT complex 1D radix- n six-step FFT 64K points FMM Fast Multipole N-body simulation 16K particles LU(Cont.) blocked dense LU factorization matrix Ocean(Cont.) ocean movement simulation ocean Radiosity computation of light equilibrium distri. room Radix integer radix sort 1M integers Raytrace rendering a 3D scene teapot Volrend rendering a 3D volume head (render only) Water-Nsqr Evaluating a water molecule system 1000 molecules Water-Sp improved version of Water-Nsqr 1000 molecules Table 1: Characteristics of the traces. System Unit Parameters Default Values # of processors = 32, page size = 4K bytes, line size = 64 bytes size (per node) 16K bytes access time 1 cycle cache cache ll time 4 cycles set associativity 4-way replacement random selection memory line access time 24 cycles network service time 30/10, 70/50, or 110/90 cycles Table 2: Parameters in simulation. 4 Performance Study We use simulations to study the eect of dierent directory schemes on system performance using the Splash-2 programs [22] as workload. Some characteristics of the Splash-2 programs are illustrated in Table 1. We use a modied MINT [20] simulator as our front end and attach a generic DSM multiprocessor (as shown in Figure 1) with our proposed directory schemes at the back end. For fast data access latency, the cache coherence protocol used is a non-blocking directory with ve cache states: invalid, shared, shared-owner, dirty, and issue [13]. This means that all subsequent requests to the same line can proceed beyond the directory. The default simulation parameters are shown in Table 2. There are 32 processors with one processor in each node. Each processor can generate one 14

17 line address (I ) l page address (I p) byte offset (I b ) line address (I ) l page address (I p) byte offset (I b ) Index for sets (I s) Index for nodes (I n) (a) page interleaving Index for sets (I s) Index for nodes (I n) (b) line interleaving Figure 7: Two interleaving schemes in our simulation. missing cache line request at a time. The size of a cache line is 64 bytes, and the page size is 4K bytes. In order to reect the working set sizes (relative to the cache memory) in our simulation, the cache size is set at 16K bytes so as to t between the rst and the second important working sets of the Splash-2 programs [22]. The cache access time is the unit of our simulation clock cycle. Four cycles are assumed for the cache ll time, and four-way set associativity for the cache memory is used. The replaced cache line is randomly selected. The memory access time is 24 cycles. Each non-memory instruction completes in a single cycle as assumed in MINT. To simplify our simulation, we do not simulate the network topology in detail, but rather assume the communication time between nodes (network latency or network service time) to be a constant regardless of their distances. We simulate three settings for the network latencies: 30 cycles/10 cycles; 70 cycles/50 cycles; and 110 cycles/90 cycles. In each pair of the settings, the longer latency is for requests with data, and the shorter latency is for requests without data. The contention at the system resources (i.e., cache, memory, directory, and network) is modeled using queues. Thus, we do simulate the memory conicts at each memory module since the data interleaving scheme could aect the memory access patterns. 4.1 Performance sensitivity to data interleaving schemes We rst study how dierent data interleaving schemes on the memory modules aect the system performance. Even though dierent data interleaving schemes will require dierent directory sizes, and hence, dierent directory processing times, we assume it is 16 cycles for all of the schemes in 15

18 this section. The eect of the directory processing time on the overall performance is studied in the next section. Also, instead of studying all possible data interleaving schemes, we only look at two schemes that place the node eld, I n, at the two extreme positions in the address. The other schemes which place the I n somewhere between these two extremes should have the results somewhat between these two schemes. Our simulation results actually show that there is very little dierence in the performance of the two schemes we studied. The rst scheme places the I n eld inside of the I p eld (see Figure 7 (a)). This corresponds to interleaving data with one page as the unit of data interleaving, i.e., page interleaving (). The second scheme places the I n eld outside the I p eld (see Figure 7 (b)). It corresponds to interleaving data with one cache line as the unit of data interleaving, i.e., line interleaving (). The page size is assumed to be 4K (or 2 12 ) bytes and the cache line size is 64 (or 2 6 ) bytes. Also, the width of I s is 6 bits because the size of the cache is assumed to be 16K (or 2 14 ) bytes, and the set associativity is four. The number of cache lines in a cache memory, c, is 214 byte cache size 2 6 byte line size = 256. The home node (I n ) of an address is determined from bit 12 to bit 16 in the scheme, and from bit 6 to bit 10 in the scheme. This will result in dierent numbers of pointers needed in the pointer pool: cn = = 8192 and c = 256 for the and the schemes, respectively, assuming 32 nodes. Our simulation results show no dierence in the hit rate for the cache memory between the two and schemes since the data interleaving schemes only aect the placement of data in the memory modules. Although the way of allocating data in the memory modules is dierent between the two schemes, it does not make a dierence in the memory hit rates. This is because both schemes allocate data in the memory modules without considering reference locality. In fact, we found that the possibility of nding the line requested by a cache miss is uniformly distributed among all memory modules. Figure 8 shows the average waiting time at the directory and at the memory module for both schemes with dierent network service times. In general, longer network service time increases the waiting time at the network while decreasing the waiting time at the other system resources. The 16

19 Waiting Time (Cycle) Waiting Time (Cycle) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) Barnes Cholesky FFT FMM LU Ocean (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) Radiosity Radix Raytrace Volrend Water-Nsqr Water-Sp Directory Memory Directory Memory Figure 8: Waiting time per request at system resources. Speedup Speedup (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) Barnes Cholesky FFT FMM LU Ocean (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) Radiosity Radix Raytrace Volrend Water-Nsqr Water-Sp legend legend blocking busy blocking busy Figure 9: Speedup. scheme has a longer average waiting time at the directory and the memory module than that of the scheme. This is because the requests for a missing cache line and a replaced cache line are sent to the same home memory module in the scheme. The replacement request which follows the missing line request usually needs to wait for the completion of the missing line request at the same memory module. The home directories of a missing cache line and a replaced cache line in the scheme are dierent in most cases. Hence, the conicts at the directory and the memory module happen less frequently. However, since the replacement request is not on the critical path of a data access, the longer waiting time does not aect the overall system performance. Our results show that there is very little dierence in the average overall data access latency. That is, the cache miss penalty is almost the same in both schemes. Figure 9 shows the speedup based on the (30/10) scheme, i.e., execution time of (30/10) execution time of a scheme. 17

20 1 (a) 16 nodes 1 (b) 32 nodes Barnes Cholesky FFT FMM LU Ocean Radiosity Radix Raytrace Volrend Water-Nsqr Water-Sp Number of pointers in use Barnes Cholesky FFT FMM LU Ocean Radiosity Radix Raytrace Volrend Water-Nsqr Water-Sp Number of pointers in use Figure 10: The number of pointers in use seen by a cache line request to the directory of the scheme The gure also shows the maximum stalled time among the processors at synchronization points (wait(), barrier(), lock(s), etc). The and schemes have almost the same performance. In the scheme, the maximum number of pointers needed in each pointer pool should be cn, where c is the number of cache lines in a cache memory and n is the number of nodes. Given our simulation parameters, cn is 8192 (= 256 cache lines in the cache memory 32 nodes). However, the actual number of the pointers used in the pointer pool will vary over the time during the program execution. The average number of pointers used in the pointer pool cannot reveal the usage of those pointers. In Figure 10, we show the number of pointers which have already been allocated when a cache line request arrives at the directory of the scheme. To compare against the scheme which needs a maximum of c pointers in its pointer pool (c = 256 in this example), the x-axis in the gure starts with c = 256. The y-axis represents the fraction of the requests which see the number of pointers in use is less than x. Figure 10 (a) and (b) assume the numbers of system nodes are 16 nodes and 32 nodes, respectively. From the gures, the number of pointers actually used is much less than the maximum numbers. Also, the number of pointers used in a program, such as LU, does not always increase as the system size grows. The number of pointers in use depends very 18

21 much on the program behavior regardless of the system size. From Figure 10 (b), even though the maximum number of the pointers which can be used by a program is = 8182, all programs uses the number of pointers less than = 1792, but the number is still much larger than 256 needed in the scheme. 4.2 Performance sensitivity to the directory processing times We consider three implementation choices for the directory; full-map directory, scheme using a head link for each cache line in the memory module to point to the rst pointer of the linked list, scheme using the hashing of the cache line address to locate the rst pointer of the linked list, designated as, and, respectively, in the gures. In our simulation, only the number of accesses to the directory is considered for the directory processing time. For the scheme, we assume that every request can be served within one DRAM access time (8 cycles). That is, the bit vector saved in DRAM can be fetched and updated in 8 cycles. For the scheme, the number of DRAM accesses per request is assumed to be two (16 cycles). In the scheme, 5 DRAM accesses (40 cycles) are used as the processing time. All three schemes have the same address mapping to the cache and the memory module, hence, there is no dierence in the cache memory and the memory module hit rates. Therefore, dierent directory schemes do not aect the number of requests to the cache memory, the memory module, the directory, and the network. Although we observe very little dierence in the number of requests among the schemes, the waiting time varies as the directory processing time increases. In general, if a request stays at a resource for a longer time, there is more possibility of conict at the resource, which in turn reduces the possibility of conict at the other resources. The dierent waiting times in turn aect the average data access latency as shown in Figure 11. The average data access latency of a cache line request includes the access latency in the local cache memory, the local memory module or a remote memory module if a node miss occurs, and a remote cache memory if it is needed. On the critical path of a data access, the waiting time caused by contention at the system resources is also included in the average latency. A longer directory processing time causes a small increase in the 19

22 Latency (Cycle) Latency (Cycle) waiting time (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) Barnes Cholesky FFT FMM LU Ocean legend waiting time (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) Radiosity Radix Raytrace Volrend Water-Nsqr Water-Sp legend remotememory hit remotecache hit localmemory hit localcache hit remotememory hit remotecache hit localmemory hit localcache hit Figure 11: Average data access latency. waiting time at the directory while it causes a large decline in the waiting time at the network. However, overall data access latency still grows. Figure 12 shows the speedup based on the system which uses the scheme with a network latency of 30 cycles for requests with data, and 10 cycles for requests without data. The results show that the performance is sensitive to the directory processing time. Hence, the size of the pointer pool (a smaller pointer pool will have a faster access time) and the management of the linked list should be as simple as possible. Schemes similar to seem to be a good choice. However, detailed hardware implementation of the pointer pool is beyond the scope of this paper. 5 Conclusion Controlling the data interleaving scheme allows us to change the data layout in the main memory without altering the cache access behavior in each cache memory. We can also take advantage of the data interleaving to control the size of directories needed to support cache coherence protocols. In this paper, we show the reduction in the directory size using a well conceived data interleaving 20

23 Speedup Speedup (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) Barnes Cholesky FFT FMM LU Ocean (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) (30/10,70/50,110/90) Radiosity Radix Raytrace Volrend Water-Nsqr Water-Sp legend legend blocking busy blocking busy Figure 12: Speedup. scheme could be quite substantial. It also allows the directory size to be scaled according to the cache memory size instead of the main memory size and the system size, which is very attractive to large-scale systems. Our simulations show that the impact of the dierent data interleaving schemes on the system performance is usually very minimal. The performance is, on the other hand, more sensitive to the directory management overhead. Hence, by selecting a suitable data interleaving scheme with an ecient pointer pool management mechanism, the directory-based cache coherence protocol could be used very cost-eectively even for very large DSM multiprocessors. References [1] A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz, J. Kubiatowicz, B. H. Lim, K. Mackenzie, D. Yeung, \The MIT Alewife Machine: Architecture and Performance," Proc. of the 22nd Annual Int'l Sym. on Comp. Archi., [2] A. Agarwal, R. Simoni, M. Horowitz, J. Hennessy, \An Evaluation of Directory Schemes for Cache Coherence," Proc. of the 15th Annual Int'l Sym. on Comp. Archi., pp , [3] E. Anderson, J. Brooks, C. Grassl, S. Scott, \Performance of the CRAY T3E Multiprocessor," Proc. of the Supercomputing'97, Aug [4] J. Archibald, J. L. Baer, \Cache Coherence Protocols Evaluation Using a Multiprocessor Simulation Model," ACM Trans. on Comp. Sys., Vol. 4, No. 4, pp , Nov [5] L. Censier, P. Feautrier, \A New Solution to Coherence Problem in Multicache Systems," IEEE Trans. Computers, Vol. C-27, No. 12, Dec. 1978, pp [6] D. Chaiken, J. Kubiatowicz, A. Agarwal, \LimitLESS Directories: A Scalable Cache Coherence Scheme," ASPLOS-IV, April [7] S. J. Eggers, R. H. Katz, \A Characterization of Sharing in Parallel Programs and its Application to Coherency Protocol Evaluation," Proc. of the 15th Annual Int'l Sym. on Comp. Archi., pp , May [8] Degital Equipment Cop., Alpha Microprocessor Hardware Reference Manual, DEC part number EC-QP99B-TE, Feb

24 [9] J. R. Goodman, \Using Cache Memory to Reduce Processor-Memory Trac," Proc. of the 10th Annual Int'l Sym. on Comp. Archi., May 1983, pp [10] D. V. James, A. T. Laundrie, S. Gjessing, G. S. Sohi, \New Directories in Scalable Shared Memory Multiprocessor Architectures: Scalable Coherent Interface," Computer, Vol. 23, No. 6, June 1990, pp [11] N.P. Jouppi, \Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buers," Proc. of the 17th ISCA, pp , [12] J. Kong, G. Lee, \Relaxing the Inclusion Property in Cache Only Memory Architecture," Euro-Par'96, Vol. II, Aug. 1996, pp [13] J. Kong, P.-C. Yew, G. Lee, \A Non-blocking Directory for Large-Scale Multiprocessors," in preparation. [14] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, J. Hennessy, \The Stanford FLASH Multiprocessor," Proc. of the 21st Annual Int'l Sym. on Comp. Archi., [15] J. Laudon, D. Lenoski, \The SGI Origin: A ccnuma Highly Scalable Server," Proc. of the 24th Annual Int'l Sym. on Comp. Archi., pp , [16] D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, J. Hennessy, \The Directory-Based Cache Coherence Protocol for DASH multiprocessor," Proc. of the 17th Annual Int'l Sym. on Comp. Archi., pp , [17] T. D. Lovett, R. M. Clapp, \STiNG: A CC-NUMA Computer System for the Commercial Marketplace," Proc. of the 23rd Annual Int'l Sym. on Comp. Archi., pp , [18] R. Simoni, M. Horowitz, \Dynamic Pointer Allocation for Scalable Cache Coherence Directories," Proc. of the Int. Sym. on Shared Memory Multiprocessing, April 1991, pp [19] P. Sweazey, A.J. Smith, \A Class of Compatible Cache Consistency Protocols and their Support by the IEEE Futurebus," Proc. of the 13th Annual Int'l Sym. on Comp. Archi., pp , [20] J. Veenstra, R. Fowler, \Mint: A Front-End for Ecient Simulation of Shared-Memory Multiprocessors," Proc. of 2nd MASCOTS, Jan.-Feb., [21] W. Weber, A. Gupta, \Analysis of Cache invalidation Patterns in Multiprocessors," ASPLOS-III, pp , April [22] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, A. Gupta, \The SPLASH-2 Programs: Characterization and Methodological Considerations," Proc. of the 22nd Annual Int'l Sym. on Comp. Archi.,

Lect. 6: Directory Coherence Protocol

Lect. 6: Directory Coherence Protocol Lect. 6: Directory Coherence Protocol Snooping coherence Global state of a memory line is the collection of its state in all caches, and there is no summary state anywhere All cache controllers monitor

More information

Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors

Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors Maged M. Michael y, Ashwini K. Nanda z, Beng-Hong Lim z, and Michael L. Scott y y University of Rochester z IBM Research Department

More information

Akhilesh Kumar and Laxmi N. Bhuyan. Department of Computer Science. Texas A&M University.

Akhilesh Kumar and Laxmi N. Bhuyan. Department of Computer Science. Texas A&M University. Evaluating Virtual Channels for Cache-Coherent Shared-Memory Multiprocessors Akhilesh Kumar and Laxmi N. Bhuyan Department of Computer Science Texas A&M University College Station, TX 77-11, USA. E-mail:

More information

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Module 14: Directory-based Cache Coherence Lecture 31: Managing Directory Overhead Directory-based Cache Coherence: Replacement of S blocks Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware

More information

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels

More information

Request Network Reply Network CPU L1 Cache L2 Cache STU Directory Memory L1 cache size unlimited L1 write buer 8 lines L2 cache size unlimited L2 outs

Request Network Reply Network CPU L1 Cache L2 Cache STU Directory Memory L1 cache size unlimited L1 write buer 8 lines L2 cache size unlimited L2 outs Evaluation of Communication Mechanisms in Invalidate-based Shared Memory Multiprocessors Gregory T. Byrd and Michael J. Flynn Computer Systems Laboratory Stanford University, Stanford, CA Abstract. Producer-initiated

More information

London SW7 2BZ. in the number of processors due to unfortunate allocation of the. home and ownership of cache lines. We present a modied coherency

London SW7 2BZ. in the number of processors due to unfortunate allocation of the. home and ownership of cache lines. We present a modied coherency Using Proxies to Reduce Controller Contention in Large Shared-Memory Multiprocessors Andrew J. Bennett, Paul H. J. Kelly, Jacob G. Refstrup, Sarah A. M. Talbot Department of Computing Imperial College

More information

Using Simple Page Placement Policies to Reduce the Cost of Cache. Fills in Coherent Shared-Memory Systems. Michael Marchetti, Leonidas Kontothanassis,

Using Simple Page Placement Policies to Reduce the Cost of Cache. Fills in Coherent Shared-Memory Systems. Michael Marchetti, Leonidas Kontothanassis, Using Simple Page Placement Policies to Reduce the Cost of Cache Fills in Coherent Shared-Memory Systems Michael Marchetti, Leonidas Kontothanassis, Ricardo Bianchini, and Michael L. Scott Department of

More information

A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors

A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors Manuel E. Acacio, José González, José M. García Dpto. Ing. y Tecnología de Computadores Universidad de Murcia 30071 Murcia (Spain)

More information

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols Cache Coherence Todd C. Mowry CS 740 November 10, 1998 Topics The Cache Coherence roblem Snoopy rotocols Directory rotocols The Cache Coherence roblem Caches are critical to modern high-speed processors

More information

Scalable Cache Coherence

Scalable Cache Coherence arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy

More information

Won{Kee Hong, Tack{Don Han, Shin{Dug Kim, and Sung{Bong Yang. Dept. of Computer Science, Yonsei University, Seoul, , Korea.

Won{Kee Hong, Tack{Don Han, Shin{Dug Kim, and Sung{Bong Yang. Dept. of Computer Science, Yonsei University, Seoul, , Korea. An Eective Full-ap Directory Scheme for the Sectored Caches Won{Kee Hong, Tack{Don Han, Shin{Dug Kim, and Sung{Bong Yang Dept. of Computer Science, Yonsei University, Seoul, 120-749, Korea. fwkhong,hantack,sdkim,yangg@kurene.yonsei.ac.kr

More information

Assert. Reply. Rmiss Rmiss Rmiss. Wreq. Rmiss. Rreq. Wmiss Wmiss. Wreq. Ireq. no change. Read Write Read Write. Rreq. SI* Shared* Rreq no change 1 1 -

Assert. Reply. Rmiss Rmiss Rmiss. Wreq. Rmiss. Rreq. Wmiss Wmiss. Wreq. Ireq. no change. Read Write Read Write. Rreq. SI* Shared* Rreq no change 1 1 - Reducing Coherence Overhead in SharedBus ultiprocessors Sangyeun Cho 1 and Gyungho Lee 2 1 Dept. of Computer Science 2 Dept. of Electrical Engineering University of innesota, inneapolis, N 55455, USA Email:

More information

Performance of MP3D on the SB-PRAM prototype

Performance of MP3D on the SB-PRAM prototype Performance of MP3D on the SB-PRAM prototype Roman Dementiev, Michael Klein and Wolfgang J. Paul rd,ogrim,wjp @cs.uni-sb.de Saarland University Computer Science Department D-66123 Saarbrücken, Germany

More information

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Optimal Topology for Distributed Shared-Memory Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Facultad de Informatica, Universidad Politecnica de Valencia P.O.B. 22012, 46071 - Valencia,

More information

Cache Coherence in Scalable Machines

Cache Coherence in Scalable Machines ache oherence in Scalable Machines SE 661 arallel and Vector Architectures rof. Muhamed Mudawar omputer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor

More information

task object task queue

task object task queue Optimizations for Parallel Computing Using Data Access Information Martin C. Rinard Department of Computer Science University of California, Santa Barbara Santa Barbara, California 9316 martin@cs.ucsb.edu

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

Hybrid Limited-Pointer Linked-List Cache Directory and Cache Coherence Protocol

Hybrid Limited-Pointer Linked-List Cache Directory and Cache Coherence Protocol Hybrid Limited-Pointer Linked-List Cache Directory and Cache Coherence Protocol Mostafa Mahmoud, Amr Wassal Computer Engineering Department, Faculty of Engineering, Cairo University, Cairo, Egypt {mostafa.m.hassan,

More information

Using Simple Page Placement Policies to Reduce the Cost of Cache Fills in Coherent Shared-Memory Systems

Using Simple Page Placement Policies to Reduce the Cost of Cache Fills in Coherent Shared-Memory Systems Using Simple Page Placement Policies to Reduce the Cost of Cache Fills in Coherent Shared-Memory Systems Michael Marchetti, Leonidas Kontothanassis, Ricardo Bianchini, and Michael L. Scott Department of

More information

Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors

Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors Dan Wallin and Erik Hagersten Uppsala University Department of Information Technology P.O. Box 337, SE-751 05 Uppsala, Sweden

More information

A Study of the Efficiency of Shared Attraction Memories in Cluster-Based COMA Multiprocessors

A Study of the Efficiency of Shared Attraction Memories in Cluster-Based COMA Multiprocessors A Study of the Efficiency of Shared Attraction Memories in Cluster-Based COMA Multiprocessors Anders Landin and Mattias Karlgren Swedish Institute of Computer Science Box 1263, S-164 28 KISTA, Sweden flandin,

More information

Cache Coherence: Part II Scalable Approaches

Cache Coherence: Part II Scalable Approaches ache oherence: art II Scalable pproaches Hierarchical ache oherence Todd. Mowry S 74 October 27, 2 (a) 1 2 1 2 (b) 1 Topics Hierarchies Directory rotocols Hierarchies arise in different ways: (a) processor

More information

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 05 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 5.1 Basic structure of a centralized shared-memory multiprocessor based on a multicore chip.

More information

An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic

An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic To appear in Parallel Architectures and Languages Europe (PARLE), July 1994 An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic Håkan Nilsson and Per Stenström Department

More information

Evaluation of memory latency in cluster-based cachecoherent multiprocessor systems with di erent interconnection topologies

Evaluation of memory latency in cluster-based cachecoherent multiprocessor systems with di erent interconnection topologies Computers and Electrical Engineering 26 (2000) 207±220 www.elsevier.com/locate/compeleceng Evaluation of memory latency in cluster-based cachecoherent multiprocessor systems with di erent interconnection

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Flynn s Classification

Flynn s Classification Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:

More information

Adaptive Migratory Scheme for Distributed Shared Memory 1. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University

Adaptive Migratory Scheme for Distributed Shared Memory 1. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University Adaptive Migratory Scheme for Distributed Shared Memory 1 Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

RICE UNIVERSITY. The Impact of Instruction-Level Parallelism on. Multiprocessor Performance and Simulation. Methodology. Vijay S.

RICE UNIVERSITY. The Impact of Instruction-Level Parallelism on. Multiprocessor Performance and Simulation. Methodology. Vijay S. RICE UNIVERSITY The Impact of Instruction-Level Parallelism on Multiprocessor Performance and Simulation Methodology by Vijay S. Pai A Thesis Submitted in Partial Fulfillment of the Requirements for the

More information

Adaptive Prefetching Technique for Shared Virtual Memory

Adaptive Prefetching Technique for Shared Virtual Memory Adaptive Prefetching Technique for Shared Virtual Memory Sang-Kwon Lee Hee-Chul Yun Joonwon Lee Seungryoul Maeng Computer Architecture Laboratory Korea Advanced Institute of Science and Technology 373-1

More information

Issues in Software Cache Coherence

Issues in Software Cache Coherence Issues in Software Cache Coherence Leonidas I. Kontothanassis and Michael L. Scott Department of Computer Science University of Rochester Rochester, NY 14627-0226 fkthanasi,scottg@cs.rochester.edu Abstract

More information

A Hybrid Shared Memory/Message Passing Parallel Machine

A Hybrid Shared Memory/Message Passing Parallel Machine A Hybrid Shared Memory/Message Passing Parallel Machine Matthew I. Frank and Mary K. Vernon Computer Sciences Department University of Wisconsin Madison Madison, WI 53706 {mfrank, vernon}@cs.wisc.edu Abstract

More information

Implementation of Atomic Primitives on Distributed Shared. Maged M. Michael and Michael L. Scott. by only a few.

Implementation of Atomic Primitives on Distributed Shared. Maged M. Michael and Michael L. Scott. by only a few. Implementation of Atomic Primitives on Distributed Shared Memory Multiprocessors Maged M. Michael and Michael L. Scott Department of Computer Science, University of Rochester, Rochester, NY 14627-226 fmichael,scottg@cs.rochester.edu

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Bushra Ahsan and Mohamed Zahran Dept. of Electrical Engineering City University of New York ahsan bushra@yahoo.com mzahran@ccny.cuny.edu

More information

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors. Architectural and Implementation Tradeoffs for Multiple-Context Processors Coping with Latency Two-step approach to managing latency First, reduce latency coherent caches locality optimizations pipeline

More information

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: Miss Rates for Snooping Protocol 4th C: Coherency Misses More processors:

More information

A MULTIPORTED NONBLOCKING CACHE FOR A SUPERSCALAR UNIPROCESSOR JAMES EDWARD SICOLO THESIS. Submitted in partial fulllment of the requirements

A MULTIPORTED NONBLOCKING CACHE FOR A SUPERSCALAR UNIPROCESSOR JAMES EDWARD SICOLO THESIS. Submitted in partial fulllment of the requirements A MULTIPORTED NONBLOCKING CACHE FOR A SUPERSCALAR UNIPROCESSOR BY JAMES EDWARD SICOLO B.S., State University of New York at Bualo, 989 THESIS Submitted in partial fulllment of the requirements for the

More information

LimitLESS Directories: A Scalable Cache Coherence Scheme. David Chaiken, John Kubiatowicz, and Anant Agarwal. Massachusetts Institute of Technology

LimitLESS Directories: A Scalable Cache Coherence Scheme. David Chaiken, John Kubiatowicz, and Anant Agarwal. Massachusetts Institute of Technology LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Laboratory for Computer Science Massachusetts Institute of Technology Cambridge, Massachusetts

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax: Consistent Logical Checkpointing Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 hone: 409-845-0512 Fax: 409-847-8578 E-mail: vaidya@cs.tamu.edu Technical

More information

Multiprocessor Cache Coherency. What is Cache Coherence?

Multiprocessor Cache Coherency. What is Cache Coherence? Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by

More information

Concepts Introduced in Appendix B. Memory Hierarchy. Equations. Memory Hierarchy Terms

Concepts Introduced in Appendix B. Memory Hierarchy. Equations. Memory Hierarchy Terms Concepts Introduced in Appendix B Memory Hierarchy Exploits the principal of spatial and temporal locality. Smaller memories are faster, require less energy to access, and are more expensive per byte.

More information

Lazy Release Consistency for Hardware-Coherent Multiprocessors

Lazy Release Consistency for Hardware-Coherent Multiprocessors Lazy Release Consistency for Hardware-Coherent Multiprocessors Leonidas I. Kontothanassis, Michael L. Scott, and Ricardo Bianchini Department of Computer Science University of Rochester Rochester, NY 14627-0226

More information

Shared vs. Snoop: Evaluation of Cache Structure for Single-chip Multiprocessors

Shared vs. Snoop: Evaluation of Cache Structure for Single-chip Multiprocessors vs. : Evaluation of Structure for Single-chip Multiprocessors Toru Kisuki,Masaki Wakabayashi,Junji Yamamoto,Keisuke Inoue, Hideharu Amano Department of Computer Science, Keio University 3-14-1, Hiyoshi

More information

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo Real-Time Scalability of Nested Spin Locks Hiroaki Takada and Ken Sakamura Department of Information Science, Faculty of Science, University of Tokyo 7-3-1, Hongo, Bunkyo-ku, Tokyo 113, Japan Abstract

More information

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T A Mean Value Analysis Multiprocessor Model Incorporating Superscalar Processors and Latency Tolerating Techniques 1 David H. Albonesi Israel Koren Department of Electrical and Computer Engineering University

More information

Design Trade-Offs in High-Throughput Coherence Controllers

Design Trade-Offs in High-Throughput Coherence Controllers Design Trade-Offs in High-Throughput Coherence Controllers Anthony-Trung Nguyen Microprocessor Research Labs Intel Corporation Santa Clara, CA 9 anthony.d.nguyen@intel.com Josep Torrellas Department of

More information

Scalable Directory Organization for Tiled CMP Architectures

Scalable Directory Organization for Tiled CMP Architectures Scalable Directory Organization for Tiled CMP Architectures Alberto Ros, Manuel E. Acacio, José M. García Departamento de Ingeniería y Tecnología de Computadores Universidad de Murcia {a.ros,meacacio,jmgarcia}@ditec.um.es

More information

Massachusetts Institute of Technology. Cambridge, MA Abstract. in Alewife, a scalable multiprocessor that is being developed at MIT.

Massachusetts Institute of Technology. Cambridge, MA Abstract. in Alewife, a scalable multiprocessor that is being developed at MIT. Latency Tolerance through Multithreading in Large-Scale Multiprocessors Kiyoshi Kurihara 3, David Chaiken, and Anant Agarwal Laboratory for Computer Science Massachusetts Institute of Technology Cambridge,

More information

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will

More information

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012) Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment

More information

An Efficient Tree Cache Coherence Protocol for Distributed Shared Memory Multiprocessors

An Efficient Tree Cache Coherence Protocol for Distributed Shared Memory Multiprocessors 352 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 3, MARCH 1999 An Efficient Tree Cache Coherence Protocol for Distributed Shared Memory Multiprocessors Yeimkuan Chang and Laxmi N. Bhuyan AbstractÐDirectory

More information

The Impact of Instruction-Level Parallelism on. Vijay S. Pai, Parthasarathy Ranganathan, and Sarita V. Adve. Rice University. Houston, Texas 77005

The Impact of Instruction-Level Parallelism on. Vijay S. Pai, Parthasarathy Ranganathan, and Sarita V. Adve. Rice University. Houston, Texas 77005 The Impact of Instruction-Level Parallelism on Multiprocessor Performance and Simulation Methodology Vijay S. Pai, Parthasarathy Ranganathan, and Sarita V. Adve Department of Electrical and Computer Engineering

More information

Chapter 5. Thread-Level Parallelism

Chapter 5. Thread-Level Parallelism Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated

More information

An Evaluation of Fine-Grain Producer-Initiated Communication in. and is fairly straightforward to implement. Therefore, commercial systems.

An Evaluation of Fine-Grain Producer-Initiated Communication in. and is fairly straightforward to implement. Therefore, commercial systems. An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors Hazim Abdel-Sha y, Jonathan Hall z, Sarita V. Adve y, Vikram S. Adve [ y Electrical and Computer Engineering

More information

A Quantitative Analysis of the Performance and Scalability of Distributed Shared Memory Cache Coherence Protocols

A Quantitative Analysis of the Performance and Scalability of Distributed Shared Memory Cache Coherence Protocols A Quantitative Analysis of the Performance and Scalability of Distributed Shared Memory Cache Coherence Protocols Mark Heinrich School of Electrical Engineering, Cornell University, Ithaca, NY 453 Vijayaraghavan

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch Directory-based Coherence Winter 2019 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Profs. Adve, Falsafi, Hill, Lebeck, Martin, Narayanasamy, Nowatzyk, Reinhardt,

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James Computer Systems Architecture I CSE 560M Lecture 18 Guest Lecturer: Shakir James Plan for Today Announcements No class meeting on Monday, meet in project groups Project demos < 2 weeks, Nov 23 rd Questions

More information

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid Chapter 5 Thread-Level Parallelism Abdullah Muzahid 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors is saturating + Modern multiple issue processors are becoming very complex

More information

Algorithms Implementing Distributed Shared Memory. Michael Stumm and Songnian Zhou. University of Toronto. Toronto, Canada M5S 1A4

Algorithms Implementing Distributed Shared Memory. Michael Stumm and Songnian Zhou. University of Toronto. Toronto, Canada M5S 1A4 Algorithms Implementing Distributed Shared Memory Michael Stumm and Songnian Zhou University of Toronto Toronto, Canada M5S 1A4 Email: stumm@csri.toronto.edu Abstract A critical issue in the design of

More information

Multicast Snooping: A Multicast Address Network. A New Coherence Method Using. With sponsorship and/or participation from. Mark Hill & David Wood

Multicast Snooping: A Multicast Address Network. A New Coherence Method Using. With sponsorship and/or participation from. Mark Hill & David Wood Multicast Snooping: A New Coherence Method Using A Multicast Address Ender Bilir, Ross Dickson, Ying Hu, Manoj Plakal, Daniel Sorin, Mark Hill & David Wood Computer Sciences Department University of Wisconsin

More information

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 11: Cache Coherence: Part II Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack) It

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

Bundling: Reducing the Overhead of Multiprocessor Prefetchers

Bundling: Reducing the Overhead of Multiprocessor Prefetchers Bundling: Reducing the Overhead of Multiprocessor Prefetchers Dan Wallin and Erik Hagersten Uppsala University, Department of Information Technology P.O. Box 337, SE-751 05 Uppsala, Sweden dan.wallin,

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

Author's personal copy

Author's personal copy J. Parallel Distrib. Comput. 68 (2008) 1413 1424 Contents lists available at ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc Two proposals for the inclusion of

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Early Experience with Profiling and Optimizing Distributed Shared Cache Performance on Tilera s Tile Processor

Early Experience with Profiling and Optimizing Distributed Shared Cache Performance on Tilera s Tile Processor Early Experience with Profiling and Optimizing Distributed Shared Cache Performance on Tilera s Tile Processor Inseok Choi, Minshu Zhao, Xu Yang, and Donald Yeung Department of Electrical and Computer

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica A New Register Allocation Scheme for Low Power Data Format Converters Kala Srivatsan, Chaitali Chakrabarti Lori E. Lucke Department of Electrical Engineering Minnetronix, Inc. Arizona State University

More information

Special Topics. Module 14: "Directory-based Cache Coherence" Lecture 33: "SCI Protocol" Directory-based Cache Coherence: Sequent NUMA-Q.

Special Topics. Module 14: Directory-based Cache Coherence Lecture 33: SCI Protocol Directory-based Cache Coherence: Sequent NUMA-Q. Directory-based Cache Coherence: Special Topics Sequent NUMA-Q SCI protocol Directory overhead Cache overhead Handling read miss Handling write miss Handling writebacks Roll-out protocol Snoop interaction

More information

III Data Structures. Dynamic sets

III Data Structures. Dynamic sets III Data Structures Elementary Data Structures Hash Tables Binary Search Trees Red-Black Trees Dynamic sets Sets are fundamental to computer science Algorithms may require several different types of operations

More information

An Adaptive Sequential Prefetching Scheme in Shared-Memory Multiprocessors

An Adaptive Sequential Prefetching Scheme in Shared-Memory Multiprocessors An Adaptive Sequential Prefetching Scheme in Shared-Memory Multiprocessors Myoung Kwon Tcheun, Hyunsoo Yoon, Seung Ryoul Maeng Department of Computer Science, CAR Korea Advanced nstitute of Science and

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Appears in the proceedings of the ACM/IEEE Supercompuing 94

Appears in the proceedings of the ACM/IEEE Supercompuing 94 Appears in the proceedings of the ACM/IEEE Supercompuing 94 A Compiler-Directed Cache Coherence Scheme with Improved Intertask Locality Lynn Choi Pen-Chung Yew Center for Supercomputing R & D Department

More information

Relative Reduced Hops

Relative Reduced Hops GreedyDual-Size: A Cost-Aware WWW Proxy Caching Algorithm Pei Cao Sandy Irani y 1 Introduction As the World Wide Web has grown in popularity in recent years, the percentage of network trac due to HTTP

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Speeding-up Synchronizations in DSM Multiprocessors

Speeding-up Synchronizations in DSM Multiprocessors Speeding-up Synchronizations in DSM Multiprocessors A. de Dios 1, B. Sahelices 1, P. Ibáñez 2, V. Viñals 2, and J.M. Llabería 3 1 Dpto. de Informática. Univ. de Valladolid agustin,benja@infor.uva.es 2

More information

Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors

Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 10, OCTOBER 1998 1041 Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors Fredrik Dahlgren, Member, IEEE

More information

The Impact of Write Back on Cache Performance

The Impact of Write Back on Cache Performance The Impact of Write Back on Cache Performance Daniel Kroening and Silvia M. Mueller Computer Science Department Universitaet des Saarlandes, 66123 Saarbruecken, Germany email: kroening@handshake.de, smueller@cs.uni-sb.de,

More information

Managing Off-Chip Bandwidth: A Case for Bandwidth-Friendly Replacement Policy

Managing Off-Chip Bandwidth: A Case for Bandwidth-Friendly Replacement Policy Managing Off-Chip Bandwidth: A Case for Bandwidth-Friendly Replacement Policy Bushra Ahsan Electrical Engineering Department City University of New York bahsan@gc.cuny.edu Mohamed Zahran Electrical Engineering

More information

A Memory Management Architecture for a Mobile Computing Environment

A Memory Management Architecture for a Mobile Computing Environment A Memory Management Architecture for a Mobile Computing Environment Shigemori Yokoyama, Takahiro Okuda2, Tadanori Mizuno2 and Takashi Watanabe2 Mitsubishi Electric Corp. Faculty of nformation, Shizuoka

More information

Fairlocks - A High Performance Fair Locking Scheme

Fairlocks - A High Performance Fair Locking Scheme Fairlocks - A High Performance Fair Locking Scheme Swaminathan Sivasubramanian, Iowa State University, swamis@iastate.edu John Stultz, IBM Corporation, jstultz@us.ibm.com Jack F. Vogel, IBM Corporation,

More information

Lecture 13. Shared memory: Architecture and programming

Lecture 13. Shared memory: Architecture and programming Lecture 13 Shared memory: Architecture and programming Announcements Special guest lecture on Parallel Programming Language Uniform Parallel C Thursday 11/2, 2:00 to 3:20 PM EBU3B 1202 See www.cse.ucsd.edu/classes/fa06/cse260/lectures/lec13

More information

Memory Consistency and Multiprocessor Performance

Memory Consistency and Multiprocessor Performance Memory Consistency Model Memory Consistency and Multiprocessor Performance Define memory correctness for parallel execution Execution appears to the that of some correct execution of some theoretical parallel

More information

SCHEDULING REAL-TIME MESSAGES IN PACKET-SWITCHED NETWORKS IAN RAMSAY PHILP. B.S., University of North Carolina at Chapel Hill, 1988

SCHEDULING REAL-TIME MESSAGES IN PACKET-SWITCHED NETWORKS IAN RAMSAY PHILP. B.S., University of North Carolina at Chapel Hill, 1988 SCHEDULING REAL-TIME MESSAGES IN PACKET-SWITCHED NETWORKS BY IAN RAMSAY PHILP B.S., University of North Carolina at Chapel Hill, 1988 M.S., University of Florida, 1990 THESIS Submitted in partial fulllment

More information

a process may be swapped in and out of main memory such that it occupies different regions

a process may be swapped in and out of main memory such that it occupies different regions Virtual Memory Characteristics of Paging and Segmentation A process may be broken up into pieces (pages or segments) that do not need to be located contiguously in main memory Memory references are dynamically

More information

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor IEEE Transactions on Parallel and Distributed Systems, Vol. 5, No. 6, June 1994, pp. 573-584.. The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor David J.

More information

Virtual Memory. Chapter 8

Virtual Memory. Chapter 8 Chapter 8 Virtual Memory What are common with paging and segmentation are that all memory addresses within a process are logical ones that can be dynamically translated into physical addresses at run time.

More information

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB Memory Consistency and Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB 1 Memory Consistency Model Define memory correctness for parallel execution Execution appears to the that

More information