Directories vs. Snooping. in Chip-Multiprocessor

Size: px
Start display at page:

Download "Directories vs. Snooping. in Chip-Multiprocessor"

Transcription

1 Directories vs. Snooping in Chip-Multiprocessor Karlen Lie Saengrawee Pratoomtong University of Wisconsin Madison Computer Sciences Department 1210 West Dayton Street Madison, WI USA

2 Abstract Advances in semiconductor integration technology have resulted in ever increasing transistor density. This has led to designs where multiple processors, each with its own caches, memory controller, network interface, as well network router, are integrated in a single chip. Previously understood and deeply explored tradeoffs in conventional multiprocessor systems need to be reevaluated. We chose to investigate the coherence choice between directory-based vs. snooping-based protocol on such chip multiprocessor system (CMP). Our result was inconclusive as we did not manage to get the snooping simulator fully working. However, from the numbers that we compiled, both protocols would perform equally well on a CMP, and the decision to choose one protocol over another should be made after considering whether the CMP would be used by itself or connected to other CMPs. 1 Introduction Our goal was to determine the implication of CMP on the choice of coherence protocol. With the advent of CMP, multiprocessor characteristics that influenced this choice changed. Most notably, CMP provided high on-chip bandwidth as well as much reduced on-chip latency. These changes resulted in systems that had different limitations as well as advantages as compared to conventional cache-coherent multiprocessors such as Symmetric Multiprocessor (SMP) and cache-coherent Non Uniform Memory Architecture (cc-numa) systems. In terms of coherence protocol, snoopy-based coherence was prevalent on smaller to mediumscale systems, with the largest of such system being the SUN Starfire[1] system capable of supporting up to 64 processors. Systems that intended to support more than 64 processors, such as the SGI Origin [2], adopted directory-based coherence protocol. This division was dictated by the common goal to achieve the best memory performance possible, both in terms of maximum data transfer rate, as well, lowest possible latency per memory operation. Generally, snoopy-based systems offered lower latency compared to directory-based systems, while directory-based systems were more scalable bandwidth-wise compared to snoopy-based systems. With multiple processors on a single chip, the division was not as clear as before. Both available

3 bandwidth and communication latency improved significantly. This led to the possible viability of system configurations that were previously considered to be impractical because of bandwidth limitation or highlatency overhead. On this project, we focused the comparison between the two coherence protocols on similarly configured system. We wanted to measure the effect of bandwidth and latency improvement on the overall execution time. We were also interested in the overall utilization of the available bandwidth, and the measure of latency for each memory instruction. To achieve this, we used the RSIM[3] multiprocessor simulator developed at Rice University. Since RSIM was originally a cc-numa system simulator, we had to incorporate the snoopy-coherence protocol and remove the directory-coherence portion from our version of RSIM. This paper is organized in the following manner. The next section discusses our methodology and setup. Section 3 presents our experiments results and discussions. Section 4 discusses related works, and finally, we present our conclusion in section 5. 2 Methodology The CMP we were emulating had the following features. We had 8 processors on a single-chip, each with its own L1 and L2 caches that were also on-chip. The directory states were also stored on-chip. For the directory-based protocol, the L2 caches of the eight processors and the directory were directly connected to a 2-D mesh interconnect. In the case of the snoopy-coherence protocol, the directory and the L2 caches were connected to a single split-transaction bus. For all of our simulations, we forced the processors to comply with Sequential Consistency (SC) model, and the coherence protocol we used was the 3-state MSI protocol. We also enforced inclusion between the L1 and the L2 caches. The rest of this section describes the simulator platform and the implemented modifications.

4 2.1 Baseline System We used RSIM as our simulation platform. As RSIM was originally a cc-numa system with directory-based coherence protocol, our baseline system was thus a CMP with directory-based coherence protocol. RSIM featured one superscalar, dynamically-scheduled uniprocessor per node, connected by a 2-D mesh network. Figure 1 shows the layout of our baseline simulation platform. The dotted-line marked the extent of the chip, with all components within the boundary to be built the same chip. Figure1. Baseline system architecture, derived from the RSIM developer manual[3]. To make unmodified RSIM emulate the behavior of a CMP, we adjusted various system parameters of the simulator. The main adjustments were the increase in bandwidth and the latency reduction of the interconnects, and the reduction of access latency of the directories and caches, all of which were assumed to be on-chip. By being on-chip and built using SRAM instead of the slower DRAM memory module, the access latency of the directory states and its cycle time between accesses were

5 consequently much smaller than the access latency and cycle time of the memory, which were still off-chip built using slower DRAM modules. Table 1 summarizes the parameter adjustments that we made to the unmodified version of RSIM to simulate our directory-based CMP. Parameter Specification Processor Speed 600 MHz Issue width / Instruction window size 4 / 64 Number of processor 8 Cache line size L1 Cache size 64 Bytes 16 KB L1 Cache Associativity 1 L1 type L1 access latency (Both tag and data) L2 Cache size WT 1.6 ns 64 KB L2 Cache Associativity 4 L2 type L2 access latency (tag /data) Maximum Bus bandwidth Bus Cycle time Memory Access latency Minimum Directory Access latency Coherence Protocol type Flitsize (Equivalent to Width of the network) Flitdelay (Latency for each flit to pass through a network switch) WB 5/6.6 ns 9.6 GB/s 300 MHz 60 ns 1.6 ns MSI 64 Bytes 1.6 ns Table1. Parameters for directory-based CMP.

6 2.2 Directory on a Bus Since our goal was to eventually have a snoopy-based multiprocessor CMP simulator to compare against the directory-based system described in section 1.1, we needed to either construct a brand new snoopy-based simulator, to choose another simulator to run the snoopy-based system on, or to modify the existing RSIM simulator to support this protocol. We decided against the first two options above based on the following reasons: Writing a brand new simulator was just not feasible in the limited amount of time available for the project. Using a different simulator to simulate the snoopy-based system would bring up too many uncontrollable invariants that would make the simulation results incomparable. Modifying the existing simulator should not pose too much of a difficulty. We thus proceeded on the next logical step of the modification of the baseline system. We modified the simulator by integrating all the processors to share the same bus. Since the bus would be our main and only communication medium, we disabled all network interfaces and updated the routing protocol to bypass those network interfaces and routed every network-bound transaction to go directly to the destination through the bus. All the parameters remained the same as the baseline system. Figure 2 shows the modified system. As in figure 1, the dotted line marks the boundary that separates on-chip from off-chip components. Figure2. Directory on a Bus

7 2.3 System with snoopy-based coherence protocol. This system was the simulator that would serve as the comparison point to the baseline simulator specified in section 2.1. It would serve as the comparison point between a directory-based and snoopybased system. Following were the modifications that we made to the directory-on-the-bus system described in section 2.2 to arrive at our system: Modified the caches such that it would snoop on the bus. Modified the order in which invalidations occurred. This was necessary because in the original RSIM, invalidations did not occur immediately. Instead, the system would first try to invalidate the L1, and then invalidate the L2. The existing delay factor to access these tags were unacceptable as well. Thus, we had to modify both the caches to handle snooping- messages simultaneously. Modified the bus to broadcast. We did not realize that the bus module in the original RSIM only supported point-to-point communication. As such, we had to rewrite a significant chunk of the bus module to incorporate this capability. Modified the bus to limit the number of outstanding transactions on the bus to be 1 transaction per cache line. Finally, instead of removing the directory module, we made use of the module to keep track of which cache line was currently in the shared-state. This way, we did not have to create a sharedline which that could needed to be pulled when any of the line was in shared-state to determine if memory should reply to a particular transaction. The directory would now cease to respond to all transactions except those that would ordinarily be serviced by the memory unit in a snoopy-based system. After applying all the above modifications, we essentially had a snooping-based system with separate processor-side cache-tags and bus-side cache-tags. The split-transaction bus had variable-response time delay, and response could come back out-of-order.

8 Up until the paper was written, we still could not get the snoopy-based system completely functioning. Even after we had resolved all the race conditions that existed in the system, we still had issues with our modifications being incompatible to the original RSIM version. 3 Results and Discussions The following section describes the results we collected from our available simulators. All the results were from the running of 3 applications from the Splash-2 benchmark suite, namely fft, radix, and mp3d. 3.1 Simulation results for the directory-on-the-bus Both the baseline and the directory-on-the-bus system had similar behavior in terms of miss/hit rate and number of references for each benchmark. Thus the simulation results presented here were based on comparison between directory latency, bus bandwidth and bus utilization. Ratio of Directory Latency (DirOnBus/Default) 3 Ratio fft radix mp3d Node Number Figure3. Directory latency ratio for directory-on-the-bus vs. baseline system.

9 Figure 3 shows that the ratio of directory latency between the directory-on-the-bus and the baseline system on all 8 processors. Directory latency (number of cycles needed/request) was derived by counting the total number of cycles elapsed from the time a coherence request was sent out to the time when all the coherence replies for that request arrived back at the directory. Ratio of how each type of coherence packet contributes to the bandwidth reduction 100% Percentage 50% 0% mp3d radix fft CopyBackINVL CopyBack INVL Spec Benchmark application Figure 4. Speculated reduction in bus bandwidth utilization. Figure 4 shows the speculated reduction in bandwidth utilization when the system architecture was changed from directory-on-the-bus to complete snoop. The values inside the table below the figure indicated the actual percentage of bandwidth reduction caused by each type of coherence packet. The bar graph represents the ratio of how each type of coherence packet contributes to the bandwidth reduction. We analyzed the total numbers of coherence packets that would not actually be sent when using snoopycoherence protocol and calculated the amount of reduction in bandwidth usage by multiplying them with the coherence packet size. In directory-based system, the directory had to generate invalidation packet to each of the sharer once it received a read own request while in snooping-based system, a read own request itself also served to invalidate all sharers. Thus the bandwidth consumed by all invalidation and invalidation

10 acknowledgement packets in our directory-on-the-bus system would not be present in a snoop-based system. Copyback and copyback invalidate packets were sent from the directory to the owner cache in response to read share and read own request. The directory-on-the-bus and the baseline simulator provided cache-to-cache transfer capability for MSI coherence protocol so that if the cache that held the block in M state received a copyback or copyback invalidate request, it would invoke the cache-to-cache transfer mechanism. In a snoopy-based system, these requests would not be applicable. Instead of having these requests, read own and read shared requests would serve the same purpose. In addition, separate messages that would originally go to the directory, acknowledgement for copyback invalidate and data transfer for copyback request would also not be present. Thus, we could calculate the total amount consumed by the two requests and their acknowledgements and subtract it from our total utilized bandwidth. For radix and mp3d, the major factor in reduction of bandwidth usage came from the removal of copyback packets, while for fft, it came from the removal of invalidation packets. Percentage of Bus Utilization 80 Percentage mp3d_ DOB mp3d_ Base radix_d OB radix_b ase fft_dob fft_base Contention and Arbitration delay Sending Packets Spec Benchmark Application Figure 5: Percentage of bus utilization.

11 Figure 5 shows the percentage of bus utilization of the directory-on-the-bus system compared to the average bus utilization among the 8 buses of the baseline system. Bus utilization was derived from 2 parts; utilization due to contention and arbitration delay and utilization due to the actual packet transfer. In the figure, <Application_name>_DOB indicates that the application was simulated on the directory-on-the-bus architecture while <Application_name>_Base indicates that the application was simulated on the baseline architecture. In a snoop-based system, we expected to see the reduction in bus utilization and contention because of the reduction in number of coherence packets sent out for a given memory operation. From the results, the directory-on-the-bus system had higher directory latency, percentage of bus utilization, and contention on the bus than the baseline system. The higher directory latency was caused by the contention for the single bus, and the fact that it was not possible for the directory to issue multiple coherence packets to different nodes at the same time. One very curious result that we obtained was the total execution time for the benchmarks. When we compared the total execution time between the baseline and directory-on-the-bus system, we had surprising result. For all benchmarks, the directory-on-the-bus actually took slightly less time, in terms of total number of simulation cycles, to complete the benchmarks. This was unexpected as the numbers for directory latency leaned in favor of the baseline system performing better than the modified system. Our guess was that even though directory latency was better in the baseline system, the time taken for a request from the cache to reach the directory was actually better in the modified system by virtue of having less hops. Thus, with applications that accessed remotely stored data frequently, directory-on-the-bus system would perform better than the baseline simulator. 3.2 Discussions The system used in our project had a maximum on-chip bandwidth of 9.6GB/s. Among all the benchmarks that we ran, mp3d consumed the most bandwidth at 4.6GB/s in the directory-on-the-bus system. The bandwidth usage for a snoopy-based system could be expected to be lower than that, as portrayed in figure 4 and 5. For our benchmark, we could see that communication bandwidth between the different processors was not a limiting factor. The main bottleneck would still be the off-chip bandwidth to

12 main memory and thus, when it came down to the choice of coherence protocol, both directory and snooping would perform equally well. We were not able to measure the latency for memory access in snoopy-based system. However, as snooping required fewer hops from the time a request was sent out to the time the reply was received, memory operations in snooping-based system should incur lower latency than to those of a similarly configured directory-based system. However, it should be noted that directory-based CMP system had several advantages. If most memory operations were to go to main memory, then access latency in directory-based system was not much worse than a snooping-based system, since access time to off-chip memory modules would dominate. Another advantage would be the ease in integrating multiple CMPs to form bigger multiprocessors. By using directory as the main coherence protocol, as more CMP modules got connected, no extra complexity would be added to support cache-coherence; the existing mechanism would suffice. 4 Related Works There are currently several proposed CMP systems. Piranha[4], brainchild of Compaq Research, was optimized for commercial workload. Piranha provided a crossbar with a 32GB/s maximum bandwidth to connect the eight simple, inorder, uniprocessors that resided on that same die. It used directory as its coherence protocol, and it utilized one big shared L2 cache which did not maintain inclusion with its L1 counterparts. Compared to Piranha, our system had lower on-chip bandwidth, and each processor still had an L1 and an L2 cache. Another proposed system was the IBM Power4. Power4[5] was also targeted at the highreliability server market. Each multichip module (MCM) had 2 uniprocessors and their own L1 caches, a shared L2 cache, and the tag for off-chip L3 cache. The on-chip bandwidth between the superscalar cores and the L2 cache was more than 100GB/s. Finally, from the academic projects, there was the Stanford Hydra CMP[6]. The Hydra had 4 processors on a single chip, a write-through bus that connected the L1 caches with the L2, main memory interface, as well as the I/O bus interface.

13 5 Conclusions and Future Works On the outset of the project, we wanted to compare the effect of CMP on the choice of directory vs. snooping based protocol. We did not manage to achieve that goal, as our snooping simulator did not quite work. However, we did obtain some measure of the expected bandwidth usage for the snooping system from our directory data. From there, we could conclude that even though a snooping-based protocol on a CMP would save a lot of bandwidth, bandwidth was actually not an issue. With the high availability of on-chip bandwidth, intra-chip communication would not be a bottleneck. Instead, more consideration should be given to how the CMP would fit on a larger picture, i.e. which protocol would be feasible if we were to connect more than one CMP together. An extension to this project would be to look at how multiple connected CMPs would perform under different coherence protocol. As CMPs were designed for high-reliability server markets, we would probably want more than just 2, 4, or 8 processors to run the server. Thus, the performance of the system could greatly be affected by how the CMPs communicated with each other.

14 References [1] A. Charlesworth. Starfire: Extending the SMP Envelope. In IEEE Micro, Vol. 18, No. 1, pages 39-49, Jan/Feb [2] J. Laudon and D. Lenoski. The SGI Origin: A ccnuma Highly Scalable Server. In Proceedings of the 24 th Annual International Symposium on Computer Architecture, pages , June [3] V. S. Pai, P. Ranganathan, S. V. Adve. RSIM Reference Manual Version 1.0. [4] L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. In Proceedings of the 27 th Annual International Symposium on Computer Architecture, June [5] K. Diefendorff. Power4 Focuses on Memory Bandwidth. Microprocessor Report, Vol 13, No. 13, Oct [6] L. Hammond, B. Hubbert, M. Siu, M. Prabhu, M. Chen, and K. Olukotun. The Stanford Hydra CMP. IEEE MICRO, Mar/Apr 1999.

A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors

A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors Manuel E. Acacio, José González, José M. García Dpto. Ing. y Tecnología de Computadores Universidad de Murcia 30071 Murcia (Spain)

More information

Flynn s Classification

Flynn s Classification Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Lect. 6: Directory Coherence Protocol

Lect. 6: Directory Coherence Protocol Lect. 6: Directory Coherence Protocol Snooping coherence Global state of a memory line is the collection of its state in all caches, and there is no summary state anywhere All cache controllers monitor

More information

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu

More information

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All

More information

Proximity-Aware Directory-based Coherence for Multi-core Processor Architectures

Proximity-Aware Directory-based Coherence for Multi-core Processor Architectures Proximity-Aware Directory-based Coherence for Multi-core Processor Architectures Jeffery A. Brown Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92093-0404

More information

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Lecture 7: Implementing Cache Coherence. Topics: implementation details Lecture 7: Implementing Cache Coherence Topics: implementation details 1 Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies,

More information

Reducing the Latency of L2 Misses in Shared-Memory Multiprocessors through On-Chip Directory Integration

Reducing the Latency of L2 Misses in Shared-Memory Multiprocessors through On-Chip Directory Integration Reducing the Latency of L2 Misses in Shared-Memory Multiprocessors through On-Chip Directory Integration Manuel E. Acacio, José González, José M. García Dpto. Ing. y Tecnología de Computadores Universidad

More information

SGI Challenge Overview

SGI Challenge Overview CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 2 (Case Studies) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived

More information

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors? Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing

More information

Efficient ECC-Based Directory Implementations for Scalable Multiprocessors

Efficient ECC-Based Directory Implementations for Scalable Multiprocessors Presented at the 2th Symposium on Computer Architecture and High-Performance Computing, October 2. Efficient ECC-Based Directory Implementations for Scalable Multiprocessors Kourosh Gharachorloo, Luiz

More information

A Case Study of Trace-driven Simulation for Analyzing Interconnection Networks: cc-numas with ILP Processors *

A Case Study of Trace-driven Simulation for Analyzing Interconnection Networks: cc-numas with ILP Processors * A Case Study of Trace-driven Simulation for Analyzing Interconnection Networks: cc-numas with ILP Processors * V. Puente, J.M. Prellezo, C. Izu, J.A. Gregorio, R. Beivide = Abstract-- The evaluation of

More information

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Thread- Level Parallelism. ECE 154B Dmitri Strukov Thread- Level Parallelism ECE 154B Dmitri Strukov Introduc?on Thread- Level parallelism Have mul?ple program counters and resources Uses MIMD model Targeted for?ghtly- coupled shared- memory mul?processors

More information

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence CS252 Spring 2017 Graduate Computer Architecture Lecture 12: Cache Coherence Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 11 Memory Systems DRAM

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Author's personal copy

Author's personal copy J. Parallel Distrib. Comput. 68 (2008) 1413 1424 Contents lists available at ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc Two proposals for the inclusion of

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Multi-Processor / Parallel Processing

Multi-Processor / Parallel Processing Parallel Processing: Multi-Processor / Parallel Processing Originally, the computer has been viewed as a sequential machine. Most computer programming languages require the programmer to specify algorithms

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Aleksandar Milenkovich 1

Aleksandar Milenkovich 1 Parallel Computers Lecture 8: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection

More information

Scalable Cache Coherence

Scalable Cache Coherence arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy

More information

The Use of Prediction for Accelerating Upgrade Misses in cc-numa Multiprocessors

The Use of Prediction for Accelerating Upgrade Misses in cc-numa Multiprocessors The Use of Prediction for Accelerating Upgrade Misses in cc-numa Multiprocessors Manuel E. Acacio, José González y,josém.garcía and JoséDuato z Universidad de Murcia, Spain. E-mail: fmeacacio,jmgarciag@ditec.um.es

More information

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Flynn Categories SISD (Single Instruction Single

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

COSC4201 Multiprocessors

COSC4201 Multiprocessors COSC4201 Multiprocessors Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Multiprocessing We are dedicating all of our future product development to multicore

More information

Cache Coherence Protocols for Chip Multiprocessors - I

Cache Coherence Protocols for Chip Multiprocessors - I Cache Coherence Protocols for Chip Multiprocessors - I John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 5 6 September 2016 Context Thus far chip multiprocessors

More information

CSC 631: High-Performance Computer Architecture

CSC 631: High-Performance Computer Architecture CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 10: Memory Part II CSC 631: High-Performance Computer Architecture 1 Two predictable properties of memory references: Temporal Locality:

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 18: Directory-Based Cache Protocols John Wawrzynek EECS, University of California at Berkeley http://inst.eecs.berkeley.edu/~cs152 Administrivia 2 Recap:

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core

More information

Scientific Applications. Chao Sun

Scientific Applications. Chao Sun Large Scale Multiprocessors And Scientific Applications Zhou Li Chao Sun Contents Introduction Interprocessor Communication: The Critical Performance Issue Characteristics of Scientific Applications Synchronization:

More information

Midterm Exam 02/09/2009

Midterm Exam 02/09/2009 Portland State University Department of Electrical and Computer Engineering ECE 588/688 Advanced Computer Architecture II Winter 2009 Midterm Exam 02/09/2009 Answer All Questions PSU ID#: Please Turn Over

More information

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains: The Lecture Contains: Shared Memory Multiprocessors Shared Cache Private Cache/Dancehall Distributed Shared Memory Shared vs. Private in CMPs Cache Coherence Cache Coherence: Example What Went Wrong? Implementations

More information

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5) CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived

More information

Scalable Cache Coherent Systems

Scalable Cache Coherent Systems NUM SS Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Evaluation of Design Alternatives for a Multiprocessor Microprocessor

Evaluation of Design Alternatives for a Multiprocessor Microprocessor Evaluation of Design Alternatives for a Multiprocessor Microprocessor Basem A. Nayfeh, Lance Hammond and Kunle Olukotun Computer Systems Laboratory Stanford University Stanford, CA 9-7 {bnayfeh, lance,

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 11

More information

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

CMSC 611: Advanced. Distributed & Shared Memory

CMSC 611: Advanced. Distributed & Shared Memory CMSC 611: Advanced Computer Architecture Distributed & Shared Memory Centralized Shared Memory MIMD Processors share a single centralized memory through a bus interconnect Feasible for small processor

More information

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville Lecture 18: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Parallel Computers Definition: A parallel computer is a collection

More information

Token Coherence. Milo M. K. Martin Dissertation Defense

Token Coherence. Milo M. K. Martin Dissertation Defense Token Coherence Milo M. K. Martin Dissertation Defense Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet/ University of Wisconsin Madison (C) 2003 Milo Martin Overview Technology and software

More information

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Module 9: Introduction to Shared Memory Multiprocessors Lecture 16: Multiprocessor Organizations and Cache Coherence Shared Memory Multiprocessors Shared Memory Multiprocessors Shared memory multiprocessors Shared cache Private cache/dancehall Distributed shared memory Shared vs. private in CMPs Cache coherence Cache coherence: Example What went

More information

Multiprocessor Cache Coherency. What is Cache Coherence?

Multiprocessor Cache Coherency. What is Cache Coherence? Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by

More information

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache

More information

Lecture 24: Virtual Memory, Multiprocessors

Lecture 24: Virtual Memory, Multiprocessors Lecture 24: Virtual Memory, Multiprocessors Today s topics: Virtual memory Multiprocessors, cache coherence 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Memory Hierarchy in a Multiprocessor

Memory Hierarchy in a Multiprocessor EEC 581 Computer Architecture Multiprocessor and Coherence Department of Electrical Engineering and Computer Science Cleveland State University Hierarchy in a Multiprocessor Shared cache Fully-connected

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3 MS 411 omputer Systems rchitecture Lecture 21 Multiprocessors 3 Outline Review oherence Write onsistency dministrivia Snooping Building Blocks Snooping protocols and examples oherence traffic and performance

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES University of Toronto Interaction of Coherence and Network 2 Cache coherence protocol drives network-on-chip traffic Scalable coherence protocols

More information

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University 18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA) Carnegie Mellon University Readings: Coherence Required for Review Papamarcos and Patel, A low-overhead coherence solution

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley Avinash Kodi Department of Electrical Engineering & Computer

More information

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence Computer Architecture ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols 1 Shared Memory Multiprocessor Memory Bus P 1 Snoopy Cache Physical Memory P 2 Snoopy

More information

Bandwidth Adaptive Snooping

Bandwidth Adaptive Snooping Two classes of multiprocessors Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet Project Computer Sciences Department University of Wisconsin

More information

Phastlane: A Rapid Transit Optical Routing Network

Phastlane: A Rapid Transit Optical Routing Network Phastlane: A Rapid Transit Optical Routing Network Mark Cianchetti, Joseph Kerekes, and David Albonesi Computer Systems Laboratory Cornell University The Interconnect Bottleneck Future processors: tens

More information

CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis Interconnection Networks Massively processor networks (MPP) Thousands of nodes

More information

Parallel Architecture. Hwansoo Han

Parallel Architecture. Hwansoo Han Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Optimizing Replication, Communication, and Capacity Allocation in CMPs Optimizing Replication, Communication, and Capacity Allocation in CMPs Zeshan Chishti, Michael D Powell, and T. N. Vijaykumar School of ECE Purdue University Motivation CMP becoming increasingly important

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections 4.2-4.4) 1 SMP/UMA/Centralized Memory Multiprocessor Main Memory I/O System

More information

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont. CS4961 Parallel Programming Lecture 4: Memory Systems and Interconnects Administrative Nikhil office hours: - Monday, 2-3PM - Lab hours on Tuesday afternoons during programming assignments First homework

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

A Case for Shared Instruction Cache on Chip Multiprocessors running OLTP

A Case for Shared Instruction Cache on Chip Multiprocessors running OLTP A Case for Shared Instruction Cache on Chip Multiprocessors running OLTP Partha Kundu, Murali Annavaram, Trung Diep, John Shen Microprocessor Research Labs, Intel Corporation {partha.kundu, murali.annavaram,

More information

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and

More information

Overview: Shared Memory Hardware

Overview: Shared Memory Hardware Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing

More information

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB Shared SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB 1 Review: Snoopy Cache Protocol Write Invalidate Protocol: Multiple readers, single writer Write to shared data: an

More information

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University 18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?

More information

Processor-Directed Cache Coherence Mechanism A Performance Study

Processor-Directed Cache Coherence Mechanism A Performance Study Processor-Directed Cache Coherence Mechanism A Performance Study H. Sarojadevi, dept. of CSE Nitte Meenakshi Institute of Technology (NMIT) Bangalore, India hsarojadevi@gmail.com S. K. Nandy CAD Lab, SERC

More information

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels

More information

DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor

DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor Kyriakos Stavrou, Paraskevas Evripidou, and Pedro Trancoso Department of Computer Science, University of Cyprus, 75 Kallipoleos Ave., P.O.Box

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD

More information

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations 1 Split Transaction Bus So far, we have assumed that a coherence operation (request, snoops, responses,

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P

More information

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based) Lecture 8: Snooping and Directory Protocols Topics: split-transaction implementation details, directory implementations (memory- and cache-based) 1 Split Transaction Bus So far, we have assumed that a

More information

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy

More information

Characteristics of Mult l ip i ro r ce c ssors r

Characteristics of Mult l ip i ro r ce c ssors r Characteristics of Multiprocessors A multiprocessor system is an interconnection of two or more CPUs with memory and input output equipment. The term processor in multiprocessor can mean either a central

More information

Scalable Cache Coherence

Scalable Cache Coherence Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient

More information

Intro to Multiprocessors

Intro to Multiprocessors The Big Picture: Where are We Now? Intro to Multiprocessors Output Output Datapath Input Input Datapath [dapted from Computer Organization and Design, Patterson & Hennessy, 2005] Multiprocessor multiple

More information