A Study of the Efficiency of Shared Attraction Memories in Cluster-Based COMA Multiprocessors

Size: px
Start display at page:

Download "A Study of the Efficiency of Shared Attraction Memories in Cluster-Based COMA Multiprocessors"

Transcription

1 A Study of the Efficiency of Shared Attraction Memories in Cluster-Based COMA Multiprocessors Anders Landin and Mattias Karlgren Swedish Institute of Computer Science Box 1263, S KISTA, Sweden flandin, Abstract The performance of a COMA multiprocessor greatly depends on the efficiency of the large node caches, the attraction memories. When more than one processor share an attraction memory its behavior is changed. From experiments with program-driven simulation we have found that clustering may improve the performance of the attraction memory significantly. Traffic is reduced, and the miss rates are lower for shared attraction memories. However, clustering may introduce contention for the attraction memory that may ruin any potential performance gain from increased attraction memory hit rate. Provided enough local bandwidth, application execution can remain efficient at higher memory pressure in clustered systems than in systems withsingle processor nodes. At very highmemory pressure some applications change behavior and start suffering from clustering. This is caused by conflict misses due to the relatively lower associativity of the shared attraction memory. 1. Introduction It is popular to build shared-memory multiprocessors using nodes with a few processors on a shared bus as building blocks. This is often found attractive since modules with two or four high-end microprocessors are readily available on the market, and are often used in workstations or highend PCs. Clustering a couple of processors together in each node is also a way to amortize the overhead for the node controller and network interface over more processors, resulting in a machine with lower overall overhead than if it had only one processor per node. In addition, the network becomes simpler since the number of nodes is reduced for a machine with a given number of processors. Clustering may also reduce the overhead for management of directory information for the same reason. The aim of this study is to understand how the efficiency of the attraction memory in a cache-only memory architecture (COMA) multiprocessor is affected when it is shared by several processors. As the study will show, there are factors that may both improve and degrade the behavior of shared attraction memories. Although sharing attraction memories between processors has been proposed long ago [6] this is the first study that actually addresses the efficiency effects of the sharing. Program driven simulation is used to study the execution of the programs in the SPLASH-2 application suite [17] to drive a simulation model of a cluster-based memory system. The performance of shared attraction memories is measured for two and four-fold clustering at varying degrees of memory overhead. There is also opportunity to compare our observations of sharing in COMAs with a few recent papers that address cache sharing. Clustering with cache sharing has been studied by Erlichson, Nayfeh et al. in [3, 15] and by Bennett et al. in [1]. Part of the results in [3] actually apply to COMAs (with infinite caches). The results they report are consistent with our observations. However, the previous studies focus on UMAs and NUMAs and miss interesting properties of clustering in COMAs that this paper and [10] points out Organization The rest of the paper is organized as follows. Section 2 briefly presents the key issues for COMA architectures and the behavior of the attraction memories. The expected benefits and drawbacks of clustering are also briefly discussed. In section 3 the experimental methodology is described. The results are presented and discussed in section 4. Related work is covered in section 5. Finally, the paper is concluded in section 6.

2 Interconnect Node Controller with AM State & Tag SLC SLC SLC SLC Attraction Memory DRAM part Figure 1. A typical clustered node with four processors sharing an attraction memory 2. COMA Background In COMA multiprocessors [6, 5], all memory is used to form large node caches called attraction memories. Since the entire memory is cache, it is possible for data to migrate dynamically and live close to the processor that currently accesses it. It is also possible to replicate data that is shared among several processors. In a cluster-based COMA, the processors in the cluster share the same attraction memory. A sketch of a node with a shared attraction memory is shown in figure 1, where four processors with private lower-level caches share the attraction memory. In contrast to conventional caches, attraction memories have to treat replacements with special care. Since there is no backup main memory, an evicted cache line must be sent to another attraction memory to ensure that the data is not lost. The replacement behavior is a key factor for the trade-off between performance and memory utilization in COMA systems [8]. An important parameter for an execution in a COMA machine is the memory pressure [16]. This is the relation between the aggregate size of the attraction memories and the size of the working set of the application. The memory pressure (MP) is defined as: Application workingset MP = T otal attraction memory size. At a high memory pressure, most of the attraction memories are filled with unique data. At low memory pressure there is much cache space left for replication. The memory pressure can be adjusted dynamically by the operating system by paging data in and out between the attraction memory and disk. By selecting the physical address for new pages, the OS can set different memory pressures for different regions of the attraction memories. To reduce memory overhead and/or to execute efficiently even when the application workingset fills most of the available memory, it is desirable to run efficiently at relatively high memory pressures. However, at high memory pressure, replication is very limited. This leads to frequent replacement of data. In a UMA or NUMA machine replacement results in increased traffic, and of course in cache misses, when the evicted data is accessed again. In a COMA, the effects of replication may be even worse. When the last copy of a datum is replaced, it must be inserted into another attraction memory to prevent that it is lost. This may cause invalidation of a shared cache line in the receiving node. Thus the hit rate may be decreased in both nodes. At lower memory pressure, it is likely that an invalid cache line is found or that the space of a shared cache line can be reused, but at high memory pressure, very few cache lines are shared, and replacement of exclusive entries, which have to be relocated to other nodes, are frequent. This leads to significant increases in communication and impacts on execution time, thus removing much of the potential performance benefits offered by the COMA over NUMA and UMA systems. The effects of this has been studied in [11, 2] Shared attraction memories Depending on the behavior of the application program, we can expect either improvements or degradations of the efficiency as the the number of processors that share an attraction memory grows. As stated above, this has been covered for normal caches in [3, 1]. We can expect that clustering improves performance for coherence misses since a writing and a subsequently reading processor may be located in the same node. It may also reduce the coherence and cold miss rates if two reading processors belong to the same node the first processor may prefetch data for the second. Furthermore, the space needed for replication of data can be shared between several processors provided that they share the same data, thus reducing the need for extra replication space. This is valuable at higher memory pressures when replication space is scarce. On the other hand, we could for example expect a degradation of performance if one of the processors in a cluster makes accesses with very poor locality which may cause replacement of data for the other processors in the cluster. Naturally we can also expect higher demand for bandwidth within each cluster as the number of processors in the cluster is increased. This may lead to performance degradation due to contention for the resources in the node. 3. Experimental methodology The experiments have been done using the SimICS simulation system [13, 14]. SimICS is an efficient instruction

3 set processor simulator that simulates multiple processors executing the SPARC V8 instruction set. Memory accesses performed by the processors are fed to a memory system simulator which models the memory architecture. In the results presented below, the simulations have been done modeling both shared and private data accesses, but without modeling instruction fetches made by the processors. Instructions are instead assumed to always hit in the cache. This is a reasonable assumption for the applications modeled, since their instruction footprints are all relatively small. All ordinary data accesses as well as synchronization accesses have been modeled. Data pages are allocated consecutively on demand, as they are accessed by the processors. Allocation of a page is done instantly, without any delay for the processor. No instruction pages are allocated COMA architecture model When evaluating different degrees of clustering, we assume that the amount of attraction memory per processor is constant. That means that the attraction memory in a node with two processors is twice the size of an attraction memory in a one processor node, etc. The number of processors in the machine is held constant in all the simulations. The architecture modeled is the Bus-based COMA [11] which is a COMA with snooping attraction memories. Coherence is maintained with an invalidation-based coherence protocol which has four states per cache line (Exclusive, Owner, Shared and Invalid). It uses an accept-based replacement strategy. This means that upon replacement of a cache line in state Exclusive or Owner, a snooping-based mechanism is used to find a receiving node that can store the replaced cache line withoutcausing further avalanching replacements. When choosing what local line to replace, entries in state Shared are prioritized over entries in the Owner and Exclusive states. When choosing a receiver of the replacement, nodes with Invalid entries are prioritized over those with Shared entries. Throughout this study, we have simulated systems with 16 processors. The system configuration has been varied between 1, 2 and 4 processors per node, yielding 16, 8 and 4 node systems respectively. The first-level cache has been 4 kbyte direct mapped for all experiments. Each processor also has it s own second-level cache. Since the relation between working set and cache sizes is critical to the results, we have tried to match those in real systems with larger working sets. The size of each second-level cache has been set to be 1=128th of the total working set allocated by the application. The attraction memory size has been varied to achieve a memory pressure variation. The memory pressures used have been 6%, 50%, 75%, 81% and 87%. This means that a single copy of the working set, entirely fills 1, 8, 12, 13 and 14 of the 16 attraction memories in a 16-node machine. This methodology results in odd cache sizes but has the advantage that it allows the application working set to be constant throughout the experiments. For each memory pressure, the attraction memory size per processor has been held constant. A node with 2 processors thus has an attraction memory twice the size of a single-processor node. esses have been assigned to processors in sequential order, with the result that processes created after each other are likely to belong to the same cluster. Thus if the application has trivial communication locality this can be exploited in the clusters. The cache line size has been held at 64 bytes, and the attraction memory has been 4-way set-associative for all experiments, except where explicitly specified. Due to implementation constraints, it is generally not considered realistic to increase the associativity much beyond this in a real system Timing model The memory system simulator models contention effects for the node controllers, attraction memory DRAMs, second-level caches and the shared bus. The processors are 4-way superscalar and run at 250 MHz, yielding a cycle time of 4 ns. The contention-less access times for reads that hit in the memory hierarchy are: first-level cache, 0 ns; secondlevel cache, 32 ns; attraction memory, 148 ns, where 24 ns are spent in the node controller/state& tag memory and 100 ns are spent reading data from the DRAM; remote access, 332 ns, of which the shared bus is occupied 2 times 20 ns. A release consistency model [4] with a 10 entry write buffer has been assumed. No pipeline effects or other stalls have been modeled the processors execute 4 instructions of any kind per cycle but stall on read misses Applications We have used the programs in the SPLASH-2 application suite [17] to drive the simulator. Table 1 shows the applications used and the corresponding working set sizes. Measurements have been done on the parallel section of the execution as recommended in the SPLASH-2 codes. Note that for three of the applications LU, Ocean and Water SPLASH-2 contains two versions, one original and one with improved locality. We have included both versions in the study since they behave quite differently. 4. Results We start by looking at the sharing effects in large attraction memories where the memory pressure is low. Then we study what happens when the memory pressure is increased. Following that, we see how this impacts the execution time on the bus-based machine model.

4 Application Description Problem Working set (MB) Barnes N-body part. 3.5 Cholesky Sparse matrix factorization tk29.o 40.5 FFT 1-dim. Six-step FFT 1 M data points 50 FMM N-body two cluster, part. 29 LU cont Blocked LU-fact., enhanced locality 512* LU non Blocked LU-factorization 512* Ocean cont Ocean movement simul., enhanced locality 258* Ocean non Ocean movement simulation 258* Radiosity Light distribution -room -batch 29 Radix integer sorting 2 M keys, radix Raytrace hierarchical ray tracing car.env -a1 36 Volrend 3-D volume rendering 256*256*126 vx head 22.5 Water n2 molecular dyn. N-body, O(n2) 512 mol. 1 Water sp molecular dyn. N-body, O(n), larger data structure 512 mol. 1.7 Table 1. Applications and working sets 2 4 0% 20% 40% 60% 80% 100% LU non Ocean non FFT Radiosity Radix Volrend LU c Water n2 FMM Cholesky Ocean c Barnes Raytrace Water sp Figure 2. Read node miss rate at low memory pressure for 2 and 4-way clustering relative to 1-processor node miss rates 4.1. Coherence misses A goodmeasure of the attraction memory efficiency is the Read Node Miss rate, RNMr. The RNMr is the fraction of all reads the processors perform that result in node misses. To compare the efficiency for 2 and 4 fold clustering, we study the relative RNMr the RNMr of a clustered system divided by the RNMr of a non-clustered system. If the memory pressure is sufficiently low, the only misses that occur in a COMA are coherence misses and cold misses. We study the behavior at 6% (= 1=16) memory pressure. Here, the caches are effectively infinite, since the entire working set fits in each attraction memory, thus no replacements occur. In figure 2 the relative RNMr is shown for clustering with 2 and 4 processors per node. The upper group of bars shows the read node miss rate for 2-processor clusters relative to the RNMr of single-processor nodes. The lower half of the graph shows the corresponding figures for 4-processor clusters. It can be seen that clustering reduces the RNMr significantly for all applications. We can also see that there is a strong correspondence between the 2-fold and 4-fold clustering gains. The average relative read node miss rate is 82% for 2-fold clustering, and 62% for 4-fold clustering Increasing the memory pressure It is not realistic to expect that COMAs will normally run at only 6% memory pressure. It is always desirable to reduce the memory overhead, and run applications with working sets close to what fits in the physical memory. The performance implications of this has been studied in [8, 11, 2]. Here we will study how the efficiency of clustering changes as the memory pressure is increased. To understand what happens we examine the traffic on the global shared bus. Figure 3 shows the traffic for the eight applications where clustering consistently is effective. 10 bars are shown for each application. The first five bars show the traffic for oneprocessor nodes at 6%, 50%, 75%, 81% and 87% MP. The last five bars show the traffic for 4-processor nodes at the same MPs. Each bar is divided into three segments representing read, write and replacement traffic. The read traffic reflects the read node misses from above. As expected, no replacements are made at 6% MP, while an increasing memory pressure leads to growing traffic, especially for reads and replacements. As can be seen, the 4-processor nodes consistently show lower global traffic. The situation is slightlydifferent for the remaining six applications. Up to 81% MP the behavior is the same as seen for the applications above. However, at 87% MP clustering

5 100% 1 proc 4 proc Read Write Replace 80% 60% 40% 20% 0% Cholesky FFT LU non Ocean c Ocean non Radix Water n2 Water sp Figure 3. Traffic for 1 and 4-processor nodes at 6%, 50% 75% 81% and 87% memory pressure 100% Read Write Replace 80% 60% 40% 8 way, 87% MP 20% 0% Barnes FMM LU c Radiosity Raytrace Volrend Figure 4. Traffic for 1 and 4-processor nodes at 6%, 50% 75% 81% and 87% memory pressure no longer reduces traffic efficiently. For four of the applications the traffic is even higher in the clustered case. Figure 4 shows the traffic for these six applications. The format is the same as above with the addition of two bars per application. The new bars show traffic at 87% MP for 1 and 4-fold clustering with an attraction memory with 8-way associativity instead of the default 4-way. The new bars are placed just after the corresponding normal 87% bar. Except for LU cont, it shows clearly that the reason for the dramatic traffic increase at high memory pressure for these applications is conflict misses in the attraction memory. While clustering significantly reduces the sensitivity for these misses up until 81% MP for all the applications, at 87% MP clustering can no longer help. Instead conflict miss traffic is increased for Radiosity, Raytrace and Volrend. For LU cont the AM associativity explans only part of the increase since also the traffic for 8-way AMs here is increased by clustering. It is interesting to note that for single processor nodes with 4-way associative attraction memories, above 76.5% MP (49=64) there is no longer space to replicate a cache line over all the 16 nodes, while 8-way associativity moves this threshold to 88.2% MP (113=128). With four-processor clusters, the corresponding levels are 81.25% MP (13=16) and 90.6% MP (29=32). This is a likely explanation to the observed effects for conflict misses. A way to overcome this limitation is to break the inclusion in the cache hierarchy as studied in [9, 2] Performance impact We choose a memory pressure of 50% as a baseline for the execution time measurements. A lower memory pressure leads to considerable memory overhead while only marginal performance gains can be made (FFT is the most sensitive application, and its execution time is improved by 4.2% when going down to 6% MP). The performance impact of clustering is dependent on several factors. Most important is attraction memory bandwidth. This is natural since the aim in a COMA is to maxi-

6 Busy SLC stall AM stall Remote stall 200% 180% 160% 140% 120% 1 proc, 50% MP 1 proc, 81% MP 4 proc, 81% MP 100% 80% 60% 40% 20% 0% Barnes Cholesky FFT FMM LU c LU non Ocean c Ocean non Radiosity Radix Raytrace Volrend Water n2 Water sp Figure 5. Execution time for 1-way clustering at 50 and 81% MP and for 4-way clustering at 81% MP mize the likelihoodthat accesses can be satisfied within the node. A single processor loads the attraction memory much less than 2 or 4 processors do. It is therefore of prime importance that the nodes are designed to tolerate the increased attraction memory load. In the original configuration where the DRAM is occupied 100 ns per AM access, 5 of the applications show significant performance degradation for 4-way clustering at 50% memory pressure. If the DRAM bandwidth is doubled (while the latency is held constant), three applications still show a significant performance degradation. These are LU non (17.8% slower), Radix (12.7%), and Ocean non (5.5%). Six applications are slightly improved, while the remaining five are unaffected. Figure 5 shows the execution time for the applications on the simulated machine with the doubled DRAM bandwidth. For each application three bars are shown. They represent the execution time for single processor nodes at 50 and 81% MP and for 4-processor nodes at 81% MP. Each bar is divided into four sections. Busy indicates that the processor is executing instructions or performing memory accesses that hit in the first-level cache. SLC stall represents the time spent waitingfor accesses that hit in the second-level cache. Similarly, AM stall and Remote stall show the time spent waiting for accesses that hit and miss in the attraction memory. Here the performance impact of clustering is clearly illustrated. Except for LU non and Radix which suffer most from node contention, the performance is significantly better for all the applications with significant remote stall time at 81% MP. For Water not much can be done, since it already spends almost all its time inside the node. Still the remote stall time is reduced significantly. We can see that for many of the applications clustering removes the performance penalty that was a result of the memory pressure increase from 50 to 81%. If the DRAM bandwidth is doubled again and the node controller gets twice the default bandwidth, all applications except for the non-optimized LU show similar or better performance with 4-way clustering than with single processor nodes even at 50% memory pressure. Similarly, if the global bus bandwidth is halved, clustering becomes even more efficient since the penalty for remote accesses is increased. The effects of this is largest for Barnes, FFT and LU non. Yet, the overall results are very similar to figure Related work This work is starts off from the observations in [3] regarding basic clustering effects. The evaluation of the performance impact of clustering with infinite caches in [3] is confirmed by us. However, the conclusions in [3] apply to UMAs and NUMAs, while the real effects for COMAs are not addressed. They conclude that clustering is only effective in small caches (since that is where the effects are the largest for caches that are backed up by real memory). This line is carried further in [15]. Our work comes to very different results regarding the effects of clustering in COMAs. In [1] Bennett et al. study a system with many similarities to [3]. Again the architecture addressed is a NUMA-like system. In [8, 9] Joe and Hennessy address the issue of memory overhead due to replication space requirements in COMA architectures. They study the effects of increased associativity as a way to reduce the effects of high memory pressure. Memory utilization in COMA has been further studied in [7, 12, 11, 2]. This is the first study to provide understanding of the real behavior of shared COMA attraction memories. A more detailed analysis of the issues addressed in this paper is given in [10].

7 6. Conclusions We have studied the effects of clustering in COMA multiprocessors. By experiments with program-driven simulation of the applications in the SPLASH-2 suite, we have found that clustering may improve the efficiency of the attraction memories significantly. Traffic is reduced, and the miss rates are lower for shared attraction memories. When the memory pressure is low, the impact of the reduced miss rate is marginal for most applications due to the low communication required for coherence traffic. As memory pressure is increased large gains can be made from clustering. The reduced need for replication space results in that application execution can remain efficient at higher memory pressure in clustered systems than in system with single processor nodes. This results in a reduced memory overhead and/or higher performance. A necessary requirement is that the attraction memory has bandwidth enough to efficiently serve the processors in the cluster. Of the 14 applications we studied, all but one showed improved performance for fourway clustering at 81% memory pressure. The fourteenth application was dominated by intra-node contention. All applications showed significant reductions in traffic due to the clustering up to 81% memory pressure. When the memory pressure was increased further to very high levels some applications changed behavior and started to degrade from clustering. We found that this was caused by conflict misses due to the relatively lower associativity of the shared attraction memory. 7. Acknowledgments We wish to thank Peter Magnusson and Bengt Werner for cooperation and help with the SimICS simulator. SICS is sponsored by Ericsson AB, Telia AB, Celsius Tech Systems AB, IBM Svenska AB, Sun Microsystems AB, Swedish Defense Material Administration (FMV), Swedish State Railroads (SJ) and the National Board for Industrial and Technical Development (Nutek). This work was partly funded by the European commission under Esprit project SODA. [4] Gharachorloo, K., Lenoski, D., Laudon, J., Gibbons, P., Gupta, A. and Hennessy,J., Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors. In eedings of ISCA-17, ACM May [5] Hagersten, E. Toward Scalable Cache Only Memory Architectures, Ph.D. Thesis, SICS Dissertation Series 08, 1992 [6] Hagersten, E., Landin, A., and Haridi, S. DDM A Cache- Only Memory Architecture, in IEEE Computer, Sept [7] Jamil, S. and Lee,G., Unallocated Memory Space in COMA Multiprocessors, in of 8th Int l Conf. on Parallel and Distributed Computing Systems, September [8] Joe, T. and Hennessy. J. L., Evaluating the Memory Overhead Required for COMA Architectures, in of 21st Int Symposium on Computer Architecture, April 1994 [9] Joe, T., COMA-F: A Non-Hierarchical Cache-Only Memory Architecture, Ph.D. Dissertation, Stanford University, March [10] Karlgren, M., Performance Characterization of Shared Attraction Memories in Cluster-Based COMA Multiprocessors, Masters Thesis, SICS Research Report, February [11] Landin, A. and Dahlgren, F., Bus-Based COMA Reducing Traffic in Shared-Bus Multiprocessors, In eedings of HPCA-2, IEEE Feb [12] Lee, G. and Jamil, S., Memory Block Relocation in Cache- Only Memory Multiprocessors, in of 7th IASTED- ISMM Int l Conf. on Parallel and Distributed Computing and Systems, October 1995 [13] Magnusson,P., and Werner,B., Efficient Memory Simulation in SimICS, 28th Annual Simulation Symposium, [14] Magnusson, P., The SimICS Home Page, [15] Nayfeh, B.A., Olukotun, K. and Singh, J.P., The Impact of Shared-Cache Clustering in Small-Scale Shared-Memory Multiprocessors, In eedings of HPCA-2, IEEE Feb [16] Saulsbury,A., Wilkinson, T., Landin, A. and Carter, J. An Argument for Simple COMA, In HPCA-95, Presentation slides ans/simple-coma/hpca-95/hpca-pres.ps [17] Woo, S.C., Ohara, M., Torrie, E., Singh, J.P. and Gupta, A., The SPLASH-2 Programs: Characterization and Methodological Considerations, In eedings of ISCA-22, ACM, June References [1] Bennett, J.K., Fletcher, K.E. and Speight, W.E., Classification of Data Accesses to Shared Remote Data Caches in Cluster Multiprocessors, Rice ELEC TR 9502, Rice University [2] Dahlgren, F., and Landin, A., Reducing the Replacement Overhead in Bus-Based COMA Multiprocessors, In eedings of HPCA-3, IEEE February [3] Erlichson, A., Nayfeh, B.A., Singh J.P. and Olukotun, K., The Benefits of Clustering in Shared Address Space Multiprocessors: An Applications-Driven Investigation, In eedings of Supercomputing 95, 1995.

Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors

Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors Dan Wallin and Erik Hagersten Uppsala University Department of Information Technology P.O. Box 337, SE-751 05 Uppsala, Sweden

More information

An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic

An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic To appear in Parallel Architectures and Languages Europe (PARLE), July 1994 An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic Håkan Nilsson and Per Stenström Department

More information

Lect. 6: Directory Coherence Protocol

Lect. 6: Directory Coherence Protocol Lect. 6: Directory Coherence Protocol Snooping coherence Global state of a memory line is the collection of its state in all caches, and there is no summary state anywhere All cache controllers monitor

More information

Bundling: Reducing the Overhead of Multiprocessor Prefetchers

Bundling: Reducing the Overhead of Multiprocessor Prefetchers Bundling: Reducing the Overhead of Multiprocessor Prefetchers Dan Wallin and Erik Hagersten Uppsala University, Department of Information Technology P.O. Box 337, SE-751 05 Uppsala, Sweden fdan.wallin,

More information

Flynn s Classification

Flynn s Classification Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Shared vs. Snoop: Evaluation of Cache Structure for Single-chip Multiprocessors

Shared vs. Snoop: Evaluation of Cache Structure for Single-chip Multiprocessors vs. : Evaluation of Structure for Single-chip Multiprocessors Toru Kisuki,Masaki Wakabayashi,Junji Yamamoto,Keisuke Inoue, Hideharu Amano Department of Computer Science, Keio University 3-14-1, Hiyoshi

More information

Adaptive Prefetching Technique for Shared Virtual Memory

Adaptive Prefetching Technique for Shared Virtual Memory Adaptive Prefetching Technique for Shared Virtual Memory Sang-Kwon Lee Hee-Chul Yun Joonwon Lee Seungryoul Maeng Computer Architecture Laboratory Korea Advanced Institute of Science and Technology 373-1

More information

Multicast Snooping: A Multicast Address Network. A New Coherence Method Using. With sponsorship and/or participation from. Mark Hill & David Wood

Multicast Snooping: A Multicast Address Network. A New Coherence Method Using. With sponsorship and/or participation from. Mark Hill & David Wood Multicast Snooping: A New Coherence Method Using A Multicast Address Ender Bilir, Ross Dickson, Ying Hu, Manoj Plakal, Daniel Sorin, Mark Hill & David Wood Computer Sciences Department University of Wisconsin

More information

Bundling: Reducing the Overhead of Multiprocessor Prefetchers

Bundling: Reducing the Overhead of Multiprocessor Prefetchers Bundling: Reducing the Overhead of Multiprocessor Prefetchers Dan Wallin and Erik Hagersten Uppsala University, Department of Information Technology P.O. Box 337, SE-751 05 Uppsala, Sweden dan.wallin,

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Boosting the Performance of Shared Memory Multiprocessors

Boosting the Performance of Shared Memory Multiprocessors Research Feature Boosting the Performance of Shared Memory Multiprocessors Proposed hardware optimizations to CC-NUMA machines shared memory multiprocessors that use cache consistency protocols can shorten

More information

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache

More information

Latency Hiding on COMA Multiprocessors

Latency Hiding on COMA Multiprocessors Latency Hiding on COMA Multiprocessors Tarek S. Abdelrahman Department of Electrical and Computer Engineering The University of Toronto Toronto, Ontario, Canada M5S 1A4 Abstract Cache Only Memory Access

More information

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering, Outline Problem statement Assumptions

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Bushra Ahsan and Mohamed Zahran Dept. of Electrical Engineering City University of New York ahsan bushra@yahoo.com mzahran@ccny.cuny.edu

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 11: Cache Coherence: Part II Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack) It

More information

DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor

DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor Kyriakos Stavrou, Paraskevas Evripidou, and Pedro Trancoso Department of Computer Science, University of Cyprus, 75 Kallipoleos Ave., P.O.Box

More information

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Module 14: Directory-based Cache Coherence Lecture 31: Managing Directory Overhead Directory-based Cache Coherence: Replacement of S blocks Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang*, Josep Torrellas University of Illinois at Urbana-Champaign *Hewlett-Packard

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012) Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

Computer Architecture

Computer Architecture Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors

More information

Flexible Use of Memory for Replication/Migration in Cache-Coherent DSM Multiprocessors

Flexible Use of Memory for Replication/Migration in Cache-Coherent DSM Multiprocessors Flexible Use of Memory for Replication/Migration in Cache-Coherent DSM Multiprocessors Vijayaraghavan Soundararajan 1, Mark Heinrich 1, Ben Verghese 2, Kourosh Gharachorloo 2, Anoop Gupta 1,3, and John

More information

Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors

Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 10, OCTOBER 1998 1041 Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors Fredrik Dahlgren, Member, IEEE

More information

ECE/CS 757: Homework 1

ECE/CS 757: Homework 1 ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Modelling Accesses to Migratory and Producer-Consumer Characterised Data in a Shared Memory Multiprocessor

Modelling Accesses to Migratory and Producer-Consumer Characterised Data in a Shared Memory Multiprocessor In Proceedings of the 6th IEEE Symposium on Parallel and Distributed Processing Dallas, October 26-28 1994, pp. 612-619. Modelling Accesses to Migratory and Producer-Consumer Characterised Data in a Shared

More information

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4. Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that

More information

An Efficient Lock Protocol for Home-based Lazy Release Consistency

An Efficient Lock Protocol for Home-based Lazy Release Consistency An Efficient Lock Protocol for Home-based Lazy ease Consistency Hee-Chul Yun Sang-Kwon Lee Joonwon Lee Seungryoul Maeng Computer Architecture Laboratory Korea Advanced Institute of Science and Technology

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

Combined Performance Gains of Simple Cache Protocol Extensions

Combined Performance Gains of Simple Cache Protocol Extensions Combined Performance Gains of Simple Cache Protocol Extensions Fredrik Dahlgren, Michel Duboi~ and Per Stenstrom Department of Computer Engineering Lund University P.O. Box 118, S-221 00 LUND, Sweden *Department

More information

Three Tier Proximity Aware Cache Hierarchy for Multi-core Processors

Three Tier Proximity Aware Cache Hierarchy for Multi-core Processors Three Tier Proximity Aware Cache Hierarchy for Multi-core Processors Akshay Chander, Aravind Narayanan, Madhan R and A.P. Shanti Department of Computer Science & Engineering, College of Engineering Guindy,

More information

Minimizing the Directory Size for Large-scale DSM Multiprocessors. Technical Report

Minimizing the Directory Size for Large-scale DSM Multiprocessors. Technical Report Minimizing the Directory Size for Large-scale DSM Multiprocessors Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building 200 Union Street SE Minneapolis,

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 11

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information

Multiprocessor Cache Coherency. What is Cache Coherence?

Multiprocessor Cache Coherency. What is Cache Coherence? Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by

More information

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

More information

h Coherence Controllers

h Coherence Controllers High-Throughput h Coherence Controllers Anthony-Trung Nguyen Microprocessor Research Labs Intel Corporation 9/30/03 Motivations Coherence Controller (CC) throughput is bottleneck of scalable systems. CCs

More information

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels

More information

A Simulation: Improving Throughput and Reducing PCI Bus Traffic by. Caching Server Requests using a Network Processor with Memory

A Simulation: Improving Throughput and Reducing PCI Bus Traffic by. Caching Server Requests using a Network Processor with Memory Shawn Koch Mark Doughty ELEC 525 4/23/02 A Simulation: Improving Throughput and Reducing PCI Bus Traffic by Caching Server Requests using a Network Processor with Memory 1 Motivation and Concept The goal

More information

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors. Architectural and Implementation Tradeoffs for Multiple-Context Processors Coping with Latency Two-step approach to managing latency First, reduce latency coherent caches locality optimizations pipeline

More information

Cache Coherence in Scalable Machines

Cache Coherence in Scalable Machines ache oherence in Scalable Machines SE 661 arallel and Vector Architectures rof. Muhamed Mudawar omputer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor

More information

Evaluation of memory latency in cluster-based cachecoherent multiprocessor systems with di erent interconnection topologies

Evaluation of memory latency in cluster-based cachecoherent multiprocessor systems with di erent interconnection topologies Computers and Electrical Engineering 26 (2000) 207±220 www.elsevier.com/locate/compeleceng Evaluation of memory latency in cluster-based cachecoherent multiprocessor systems with di erent interconnection

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Portland State University ECE 588/688. Cache-Only Memory Architectures

Portland State University ECE 588/688. Cache-Only Memory Architectures Portland State University ECE 588/688 Cache-Only Memory Architectures Copyright by Alaa Alameldeen and Haitham Akkary 2018 Non-Uniform Memory Access (NUMA) Architectures Physical address space is statically

More information

Exploiting Spatial Store Locality through Permission Caching in Software DSMs

Exploiting Spatial Store Locality through Permission Caching in Software DSMs Exploiting Spatial Store Locality through Permission Caching in Software DSMs Håkan Zeffer, Zoran Radović, Oskar Grenholm, and Erik Hagersten Uppsala University, Dept. of Information Technology, P.O. Box

More information

Effect of Data Prefetching on Chip MultiProcessor

Effect of Data Prefetching on Chip MultiProcessor THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE. 819-0395 744 819-0395 744 E-mail: {fukumoto,mihara}@c.csce.kyushu-u.ac.jp, {inoue,murakami}@i.kyushu-u.ac.jp

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel. Multiprocessor Systems A Multiprocessor system generally means that more than one instruction stream is being executed in parallel. However, Flynn s SIMD machine classification, also called an array processor,

More information

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 05 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 5.1 Basic structure of a centralized shared-memory multiprocessor based on a multicore chip.

More information

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Lecture 7: Implementing Cache Coherence. Topics: implementation details Lecture 7: Implementing Cache Coherence Topics: implementation details 1 Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies,

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.

More information

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language Martin C. Rinard (martin@cs.ucsb.edu) Department of Computer Science University

More information

Directories vs. Snooping. in Chip-Multiprocessor

Directories vs. Snooping. in Chip-Multiprocessor Directories vs. Snooping in Chip-Multiprocessor Karlen Lie soph@cs.wisc.edu Saengrawee Pratoomtong pratoomt@cae.wisc.edu University of Wisconsin Madison Computer Sciences Department 1210 West Dayton Street

More information

Scalable Cache Coherence

Scalable Cache Coherence arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy

More information

Cache Injection on Bus Based Multiprocessors

Cache Injection on Bus Based Multiprocessors Cache Injection on Bus Based Multiprocessors Aleksandar Milenkovic, Veljko Milutinovic School of Electrical Engineering, University of Belgrade E-mail: {emilenka,vm@etf.bg.ac.yu, Http: {galeb.etf.bg.ac.yu/~vm,

More information

2. Futile Stall HTM HTM HTM. Transactional Memory: TM [1] TM. HTM exponential backoff. magic waiting HTM. futile stall. Hardware Transactional Memory:

2. Futile Stall HTM HTM HTM. Transactional Memory: TM [1] TM. HTM exponential backoff. magic waiting HTM. futile stall. Hardware Transactional Memory: 1 1 1 1 1,a) 1 HTM 2 2 LogTM 72.2% 28.4% 1. Transactional Memory: TM [1] TM Hardware Transactional Memory: 1 Nagoya Institute of Technology, Nagoya, Aichi, 466-8555, Japan a) tsumura@nitech.ac.jp HTM HTM

More information

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: Miss Rates for Snooping Protocol 4th C: Coherency Misses More processors:

More information

EE382 Processor Design. Processor Issues for MP

EE382 Processor Design. Processor Issues for MP EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 12: Advanced Caching Prof. Onur Mutlu Carnegie Mellon University Announcements Chuck Thacker (Microsoft Research) Seminar Tomorrow RARE: Rethinking Architectural

More information

An Adaptive Sequential Prefetching Scheme in Shared-Memory Multiprocessors

An Adaptive Sequential Prefetching Scheme in Shared-Memory Multiprocessors An Adaptive Sequential Prefetching Scheme in Shared-Memory Multiprocessors Myoung Kwon Tcheun, Hyunsoo Yoon, Seung Ryoul Maeng Department of Computer Science, CAR Korea Advanced nstitute of Science and

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Performance of coherence protocols

Performance of coherence protocols Performance of coherence protocols Cache misses have traditionally been classified into four categories: Cold misses (or compulsory misses ) occur the first time that a block is referenced. Conflict misses

More information

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it

More information

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity Donghyuk Lee Carnegie Mellon University Problem: High DRAM Latency processor stalls: waiting for data main memory high latency Major bottleneck

More information

Scalability of the RAMpage Memory Hierarchy

Scalability of the RAMpage Memory Hierarchy Scalability of the RAMpage Memory Hierarchy Philip Machanick Department of Computer Science, University of the Witwatersrand, philip@cs.wits.ac.za Abstract The RAMpage hierarchy is an alternative to the

More information

Managing Off-Chip Bandwidth: A Case for Bandwidth-Friendly Replacement Policy

Managing Off-Chip Bandwidth: A Case for Bandwidth-Friendly Replacement Policy Managing Off-Chip Bandwidth: A Case for Bandwidth-Friendly Replacement Policy Bushra Ahsan Electrical Engineering Department City University of New York bahsan@gc.cuny.edu Mohamed Zahran Electrical Engineering

More information

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

More information

Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors

Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors Maged M. Michael y, Ashwini K. Nanda z, Beng-Hong Lim z, and Michael L. Scott y y University of Rochester z IBM Research Department

More information

Lecture 12: Hardware/Software Trade-Offs. Topics: COMA, Software Virtual Memory

Lecture 12: Hardware/Software Trade-Offs. Topics: COMA, Software Virtual Memory Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory 1 Capacity Limitations P P P P B1 C C B1 C C Mem Coherence Monitor Mem Coherence Monitor B2 In a Sequent NUMA-Q design above,

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

Scientific Applications. Chao Sun

Scientific Applications. Chao Sun Large Scale Multiprocessors And Scientific Applications Zhou Li Chao Sun Contents Introduction Interprocessor Communication: The Critical Performance Issue Characteristics of Scientific Applications Synchronization:

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Midterm Exam 02/09/2009

Midterm Exam 02/09/2009 Portland State University Department of Electrical and Computer Engineering ECE 588/688 Advanced Computer Architecture II Winter 2009 Midterm Exam 02/09/2009 Answer All Questions PSU ID#: Please Turn Over

More information

Early Experience with Profiling and Optimizing Distributed Shared Cache Performance on Tilera s Tile Processor

Early Experience with Profiling and Optimizing Distributed Shared Cache Performance on Tilera s Tile Processor Early Experience with Profiling and Optimizing Distributed Shared Cache Performance on Tilera s Tile Processor Inseok Choi, Minshu Zhao, Xu Yang, and Donald Yeung Department of Electrical and Computer

More information

Memory Hierarchy: Caches, Virtual Memory

Memory Hierarchy: Caches, Virtual Memory Memory Hierarchy: Caches, Virtual Memory Readings: 5.1-5.4, 5.8 Big memories are slow Computer Fast memories are small Processor Memory Devices Control Input Datapath Output Need to get fast, big memories

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

New Memory Organizations For 3D DRAM and PCMs

New Memory Organizations For 3D DRAM and PCMs New Memory Organizations For 3D DRAM and PCMs Ademola Fawibe 1, Jared Sherman 1, Krishna Kavi 1 Mike Ignatowski 2, and David Mayhew 2 1 University of North Texas, AdemolaFawibe@my.unt.edu, JaredSherman@my.unt.edu,

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1 Memory technology & Hierarchy Caching and Virtual Memory Parallel System Architectures Andy D Pimentel Caches and their design cf Henessy & Patterson, Chap 5 Caching - summary Caches are small fast memories

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information