Cache Coherence Protocols for Chip Multiprocessors - I
|
|
- Elijah Jordan
- 6 years ago
- Views:
Transcription
1 Cache Coherence Protocols for Chip Multiprocessors - I John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 5 6 September 2016
2 Context Thus far chip multiprocessors hardware threading strategies simultaneous multithreading fine-grain multithreading future microprocessor issues and trends Today: sharing cache in chip multiprocessors 2
3 Context Thus far chip multiprocessors hardware threading strategies simultaneous multithreading fine-grain multithreading future microprocessor issues and trends Today: sharing cache in chip multiprocessors cache coherence victim replication 3
4 Today s References Chapter 6: Coherence Protocols; Chapter 7 Snooping Coherence Protocols; Chapter 8: Directory Coherence Protocols. A Primer on Memory Consistency and Cache Coherence. Daniel J. Sorin, Mark D. Hill, David A. Wood Synthesis Lectures on Computer Architecture. Morgan Claypool Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. In Proceedings 32nd International Symposium on Computer Architecture, Madison, WI, June
5 A Primer on Caching and Coherence 5
6 Cache A content-addressable memory used to store data items so that future requests will be served faster reduces average latencies to access storage How does a datum become stored in a cache? value written by an earlier computation duplicate of a value available from storage elsewhere Results of load/store operations cache hit requested data is present in cache cache miss requested data is not present in cache 6
7 Consistency vs. Coherence Consistency models (aka memory models) define correct shared memory behavior in terms of loads and stores (memory reads and writes), without reference to caches or coherence can stores be seen out of order? if so, under what conditions? sequential consistency vs. weak memory models Coherence problems can arise if multiple actors (e.g., multiple cores) have access to multiple copies of a datum (e.g., in multiple caches) and at least one access is a write must appear to be one and only one value per memory location access to stale data (incoherence) is prevented using a coherence protocol set of rules implemented by the distributed actors within a system 7
8 Goal of Coherence Protocols Maintain coherence by enforcing the following invariants Single-Writer, Multiple-Reader (SWMR) Invariant for any memory location A, at any given time, there exists only a single core that may write to A (that core can also read it) or some number of cores that may only read A Data-Value Invariant the value of the memory location at the start of an epoch is the same as its value at the end of its last read-write epoch 8
9 Implementing Coherence Invariants Hardware: typical of systems today each cache and the LLC/memory has an associated a finite state machine known as a coherence controller set of controllers form a distributed system controllers exchange messages to ensure that, for each block, the SWMR and data value invariants are maintained at all times Software relies on compiler and/or runtime support may or may not have help from the hardware must be conservative to be safe assume the worst about potential memory aliases of increasing interest concerns about cost of coherence in joules scales well for microprocessors based on tiled designs Intel Scalable Cloud Computer (SCC),
10 Cache Controller Cache controller accepts loads and stores from the core and returns load values to the core On a cache miss, a controller initiates a coherence transaction by issuing a coherence request for the block containing the location accessed by the core Cache controller listens for and responds to coherence requests from other caches Implements a set of finite state machines logically per block and receives and processes events (e.g., incoming coherence messages) depending upon the block s state Figure Credit: Daniel J. Sorin, Mark D. Hill, David A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture. Morgan Claypool
11 State Diagram for 4-state Invalidate Protocol MESI states: Modified, Exclusive, Shared, and Invalid Permissible state pairs for a pair of caches M M E S I E S I MESI figure credit: (Copyright Michael Thomadakis, Texas A&M ) 11
12 Memory Controller Memory controller similar to a cache controller listens for and responds to coherence requests from caches Only a network side does not issue coherence requests (on behalf of loads or stores) receive coherence responses Figure Credit: Daniel J. Sorin, Mark D. Hill, David A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture. Morgan Claypool
13 Snooping Coherence Snoopy cache systems broadcast all invalidates and read requests all coherence controllers listen and perform appropriate coherence operations locally Figure Credit: Daniel J. Sorin, Mark D. Hill, David A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture. Morgan Claypool
14 Operation of Snoopy Caches Once a datum is tagged modified or exclusive all subsequent operations can be performed locally in cache no external traffic needed If a data item is read by a number of processors transitions to the shared state in all caches all subsequent read operations become local If multiple processors read and update data generate coherence requests on the bus bus is bandwidth limited: imposes a limit on updates per second 14
15 Directory-based Coherence Snooping protocol: a cache controller initiates a request for a block by broadcasting a request message to all other coherence controllers A directory maintains a global view of each block tracks which caches hold each block and in what states Directory protocol: a cache controller initiates a request for a block by sending it to the memory controller that is the home for that block Figure Credit: Daniel J. Sorin, Mark D. Hill, David A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture. Morgan Claypool
16 Intel MESIF Protocol (2005) MESIF: Modified, Exclusive, Shared, Invalid and Forward If a cache line is shared one shared copy of the cache line is in the F state remaining copies of the cache line are in the S state Forward (F) state designates a single copy of data from which further copies can be made cache line in the F state will respond to a request for a copy of the cache line consider how one embodiment of the protocol responds to a read newly created copy is placed in the F state cache line previously in the F state is put in the S or the I state H. Hum et al. US Patent 6,922,756. July
17 Dance-Hall Shared Cache CMPs Niagara1 L1 cache co-located with PE PEs on far side of interconnect from L2 cache each L2 cache equidistant from all cores Figure credit: Niagara: A 32-Way Multithreaded SPARC Processor, P. Kongetira, K. Aingaran, and K. Olukotun, IEEE Micro, pp , March-April
18 Blue Gene/Q s BGC Chip (2012) System on a chip processor, memory, network logic 360mm 2, 1.47B transistors 16 user + 1 service cores + 1 spare core all cores are symmetric 4-way SMT per core Shared L2 cache: 32MB edram multi-versioned cache transactional memory, speculative execution, atomic operations latency ~80 cycles Dual memory controller 16GB external DDR3 memory 1.3GB/s 2 x 16 byte wide interface (+ECC) Chip-to-chip networking integrated router for 5D torus Figure and information credit: Blue Gene/ Q compute chip. Ruud Haring. Hot Chips 23. August
19 Emerging Tiled Architectures Trends more processor cores larger cache sizes deeper cache hierarchies Implications wire delay of tens of clock cycles across chip worst case latency: likely unacceptable hit times Tiled chip multiprocessors approach co-locate part of shared cache near each core reduce access latency to (at least some) shared data 19
20 Tiled Chip Multiprocessors Advantages Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June Simpler replicated physical design readily scale to larger processor counts Can support product families with different number of tiles 20
21 Alternatives for Managing Tiled L2 in CMPs Treat each slice as a private L2 cache per tile (L2P) Manage all slices as a single large shared L2 cache (L2S) Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June
22 Implications of L2 Caching Strategy Manage each slice as a private L2 cache per tile use directory approach to keep caches coherent tags duplicated and distributed across tiles by set index + delivers lowest hit latency works well when your working set fits in your private L2 reduces total effective cache capacity each tile has a local copy of each line it touches can t borrow L2 space from other PEs with less full caches Manage all slices as a single large shared L2 cache focus: NUCA (non-uniform cache architecture) designs differs from dancehall design in Niagara and Blue Gene/Q + shared L2 increases effective cache capacity for shared data incur long hit latencies when L2 data is on remote tile migration-based NUCA protocols seem problematic 22
23 Victim Replication Combines advantages of private and shared L2$ schemes Variant of shared scheme Attempts to keep copies of local L1$ victims in local L2$ retained victim is a replica of one in an L2 on remote home tile 23
24 Victim Replication in Action Dynamically build a small victim cache in L2 Processor misses in shared L2 bring line from memory place in L2 in a home tile determined by subset of address bits also bring into L1 of requester Incoming invalidation to a processor follow usual L2S protocol (check local L1 and L2) If L1 line is evicted on conflict or capacity miss attempt to copy victim line into local L2 Primary cache misses must check for local replica on miss no replica: forward request to home tile on replica hit: invalidate replica in local L2, move to local L1 24
25 Victim Replacement Policy Never evict global shared line in favor of local replica L2VR replaces lines in following priority order invalid line global line with no sharers existing replica If no lines belong to these categories no replica is made in local L2 cache victim evicted from the tile as in L2S More than one candidate line? pick at random 25
26 Advantages of Victim Replication Hits to replicated copies reduce effective latency of shared L2 Higher effective capacity for shared data than private L2 26
27 Victim Replication Evaluation Parameters 8-way CMP: 4x2 grid Associativity is 2x #PE Problematic for large tiled CMP? Table credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June
28 VR Single-threaded Benchmarks Table credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June
29 Single Threaded Access Latencies L2VR adapts to provide 3-level hierarchy: L1, local L2, remote L2 L2S latency is higher than competitors for singlethreaded programs L2VR latency is close to L2P Lower is better! Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June
30 Single Threaded Off-chip Miss Rate Lower miss rates than L2P Slightly higher than L2S Lower is better! Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June
31 Single Threaded On-chip Coherence Traffic 71% fewer coherence msg hops using L2VR than L2S L2VR comparable to L2P Lower is better! Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June
32 VR Multithreaded Benchmarks Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June
33 Multi-Threaded Average Access Latency L2 slice = 1MB CG almost fits in private L2 cache; low latency of L2P helps high (9%) L1 miss rate IS fit in L1 BT, FT, LU, SP, apache fit in L2 slice MG, EP, checkers better with L2VR Lower is better! Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June
34 Multi-Threaded Off-chip Miss Rates CG almost fits in L2 cache; L1 miss latency of L2P dominates cost of off-chip traffic MG and EP improve with L2VR: they have fewer offchip misses than L2P dbench: high miss rates regardless Lower is better! Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June
35 Multi-Threaded On-chip Coherence Traffic L2VR has less coherence traffic than L2S Lower is better! Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June
36 MT: Avg % L2$ as Replica Over Time L2VR is adaptive: differs across applications; differs over time Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June
37 MT Memory Access Breakdown (L2P, L2S, L2VR) Ideally: low # of misses, most hits in local L2 Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June
38 Victim Replication Summary Distributed shared L2 caches decrease off-chip traffic vs. private caches at the expense of latency Victim replication reduces on-chip latency by replicating cache lines within same level of cache near threads that are actively accessing the line Result: dynamically self-tuning hybrid between private and shared caches Multithreaded benchmark results summary in most cases, L2VR creates enough replicas so that performance is usually within 5% of L2P L2VR reduces memory latency by avg. of 16% compared to L2S CG is the only case where L2P significantly outperforms both L2S and L2VR (almost fits in private L2) 38
39 Additional References Victim Caching Jouppi, N. P Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. SIGARCH Comput. Archit. News 18, 3a (Jun. 1990), DOI= Shared Caches in Multicores: The Good, The Bad, and The Ugly. Mary Jane Irwin. Athena Award Lecture. International Symposium on Computer Architecture, Saint-Malo, France, June
Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:
The Lecture Contains: Shared Memory Multiprocessors Shared Cache Private Cache/Dancehall Distributed Shared Memory Shared vs. Private in CMPs Cache Coherence Cache Coherence: Example What Went Wrong? Implementations
More informationModule 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors
Shared Memory Multiprocessors Shared memory multiprocessors Shared cache Private cache/dancehall Distributed shared memory Shared vs. private in CMPs Cache coherence Cache coherence: Example What went
More informationShared Symmetric Memory Systems
Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More information4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins
4 Chip Multiprocessors (I) Robert Mullins Overview Coherent memory systems Introduction to cache coherency protocols Advanced cache coherency protocols, memory systems and synchronization covered in the
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationCache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationOptimizing Replication, Communication, and Capacity Allocation in CMPs
Optimizing Replication, Communication, and Capacity Allocation in CMPs Zeshan Chishti, Michael D Powell, and T. N. Vijaykumar School of ECE Purdue University Motivation CMP becoming increasingly important
More informationLecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)
Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1) 1 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed the misses for an infinite cache
More informationShared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network
Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604
More informationAgenda. System Performance Scaling of IBM POWER6 TM Based Servers
System Performance Scaling of IBM POWER6 TM Based Servers Jeff Stuecheli Hot Chips 19 August 2007 Agenda Historical background POWER6 TM chip components Interconnect topology Cache Coherence strategies
More informationOverview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware
Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and
More informationOverview: Shared Memory Hardware
Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing
More informationLecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)
Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 More Cache Basics caches are split as instruction and data; L2 and L3 are unified The /L2 hierarchy can be inclusive,
More informationESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence
Computer Architecture ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols 1 Shared Memory Multiprocessor Memory Bus P 1 Snoopy Cache Physical Memory P 2 Snoopy
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working
More informationMultiprocessor Cache Coherency. What is Cache Coherence?
Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationFoundations of Computer Systems
18-600 Foundations of Computer Systems Lecture 21: Multicore Cache Coherence John P. Shen & Zhiyi Yu November 14, 2016 Prevalence of multicore processors: 2006: 75% for desktops, 85% for servers 2007:
More informationMultiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism
Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationModule 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:
The Lecture Contains: Four Organizations Hierarchical Design Cache Coherence Example What Went Wrong? Definitions Ordering Memory op Bus-based SMP s file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture10/10_1.htm[6/14/2012
More informationCOEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence
1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core
More informationCS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence
CS252 Spring 2017 Graduate Computer Architecture Lecture 12: Cache Coherence Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 11 Memory Systems DRAM
More informationBandwidth Adaptive Snooping
Two classes of multiprocessors Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet Project Computer Sciences Department University of Wisconsin
More information12 Cache-Organization 1
12 Cache-Organization 1 Caches Memory, 64M, 500 cycles L1 cache 64K, 1 cycles 1-5% misses L2 cache 4M, 10 cycles 10-20% misses L3 cache 16M, 20 cycles Memory, 256MB, 500 cycles 2 Improving Miss Penalty
More informationLecture 7: Implementing Cache Coherence. Topics: implementation details
Lecture 7: Implementing Cache Coherence Topics: implementation details 1 Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies,
More informationLecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)
Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 Techniques to Reduce Cache Misses Victim caches Better replacement policies pseudo-lru, NRU Prefetching, cache
More informationLecture 11: Large Cache Design
Lecture 11: Large Cache Design Topics: large cache basics and An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS 02 Distance Associativity for High-Performance
More informationThe Cache Write Problem
Cache Coherency A multiprocessor and a multicomputer each comprise a number of independent processors connected by a communications medium, either a bus or more advanced switching system, such as a crossbar
More informationVictim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors Michael Zhang and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory The Stata Center,
More informationKaisen Lin and Michael Conley
Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC
More informationPortland State University ECE 588/688. Cache Coherence Protocols
Portland State University ECE 588/688 Cache Coherence Protocols Copyright by Alaa Alameldeen 2018 Conditions for Cache Coherence Program Order. A read by processor P to location A that follows a write
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)
1 MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) Chapter 5 Appendix F Appendix I OUTLINE Introduction (5.1) Multiprocessor Architecture Challenges in Parallel Processing Centralized Shared Memory
More informationChapter 9 Multiprocessors
ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 11
More informationA Study of Cache Organizations for Chip- Multiprocessors
A Study of Cache Organizations for Chip- Multiprocessors Shatrugna Sadhu, Hebatallah Saadeldeen {ssadhu,heba}@cs.ucsb.edu Department of Computer Science University of California, Santa Barbara Abstract
More informationCS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)
CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived
More informationIntroduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico
More informationCMSC 611: Advanced. Distributed & Shared Memory
CMSC 611: Advanced Computer Architecture Distributed & Shared Memory Centralized Shared Memory MIMD Processors share a single centralized memory through a bus interconnect Feasible for small processor
More informationCache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri
Cache Coherence (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri mainakc@cse.iitk.ac.in 1 Setting Agenda Software: shared address space Hardware: shared memory multiprocessors Cache
More information10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems
1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationMultiprocessor Synchronization
Multiprocessor Systems Memory Consistency In addition, read Doeppner, 5.1 and 5.2 (Much material in this section has been freely borrowed from Gernot Heiser at UNSW and from Kevin Elphinstone) MP Memory
More informationLecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University
Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:
More informationRethinking Last-Level Cache Management for Multicores Operating at Near-Threshold
Rethinking Last-Level Cache Management for Multicores Operating at Near-Threshold Farrukh Hijaz, Omer Khan University of Connecticut Power Efficiency Performance/Watt Multicores enable efficiency Power-performance
More informationLecture 24: Multiprocessing Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this
More informationChapter Seven. Idea: create powerful computers by connecting many smaller ones
Chapter Seven Multiprocessors Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) vector processing may be coming back bad news:
More informationLecture 1: Introduction
Lecture 1: Introduction ourse organization: 4 lectures on cache coherence and consistency 2 lectures on transactional memory 2 lectures on interconnection networks 4 lectures on caches 4 lectures on memory
More informationEECS 570 Final Exam - SOLUTIONS Winter 2015
EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32
More informationCSC 631: High-Performance Computer Architecture
CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 10: Memory Part II CSC 631: High-Performance Computer Architecture 1 Two predictable properties of memory references: Temporal Locality:
More informationCS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.
CS4961 Parallel Programming Lecture 4: Memory Systems and Interconnects Administrative Nikhil office hours: - Monday, 2-3PM - Lab hours on Tuesday afternoons during programming assignments First homework
More informationImproving Virtual Machine Scheduling in NUMA Multicore Systems
Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationLecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment
More informationCOSC4201 Multiprocessors
COSC4201 Multiprocessors Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Multiprocessing We are dedicating all of our future product development to multicore
More informationCMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis Interconnection Networks Massively processor networks (MPP) Thousands of nodes
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it!
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it! Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Memory Computer Technology
More informationParallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence
Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture
More informationSpeculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution
Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding
More informationFlynn s Classification
Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:
More informationComputer Science 146. Computer Architecture
Computer Architecture Spring 24 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 2: More Multiprocessors Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory
More informationGLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs
GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs Authors: Jos e L. Abell an, Juan Fern andez and Manuel E. Acacio Presenter: Guoliang Liu Outline Introduction Motivation Background
More informationSimultaneous Multithreading and the Case for Chip Multiprocessing
Simultaneous Multithreading and the Case for Chip Multiprocessing John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 2 10 January 2019 Microprocessor Architecture
More informationECE 485/585 Microprocessor System Design
Microprocessor System Design Lecture 11: Reducing Hit Time Cache Coherence Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering and Computer Science Source: Lecture based
More informationHandout 3 Multiprocessor and thread level parallelism
Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed
More informationLecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro
Lecture 7: M Wrap-Up, ache coherence Topics: handling M errors and writes, cache coherence intro 1 Optimizations for Writes (Energy, Lifetime) Read a line before writing and only write the modified bits
More informationLecture 24: Virtual Memory, Multiprocessors
Lecture 24: Virtual Memory, Multiprocessors Today s topics: Virtual memory Multiprocessors, cache coherence 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large
More informationLecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations
Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations 1 Design Issues, Optimizations When does memory get updated? demotion from modified to shared? move from modified in
More informationLecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform
More information1. Memory technology & Hierarchy
1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In
More informationLecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol
Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu
More informationSpring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University
18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?
More informationMULTIPROCESSORS AND THREAD LEVEL PARALLELISM
UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationLimitations of parallel processing
Your professor du jour: Steve Gribble gribble@cs.washington.edu 323B Sieg Hall all material in this lecture in Henessey and Patterson, Chapter 8 635-640 645, 646 654-665 11/8/00 CSE 471 Multiprocessors
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationParallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?
Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing
More informationECE/CS 757: Homework 1
ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)
More informationEECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch
Directory-based Coherence Winter 2019 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Profs. Adve, Falsafi, Hill, Lebeck, Martin, Narayanasamy, Nowatzyk, Reinhardt,
More informationLecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections
Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections 4.2-4.4) 1 SMP/UMA/Centralized Memory Multiprocessor Main Memory I/O System
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationOne-Level Cache Memory Design for Scalable SMT Architectures
One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract
More informationEE382 Processor Design. Processor Issues for MP
EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency
More informationModule 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks
Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware
More informationECSE 425 Lecture 30: Directory Coherence
ECSE 425 Lecture 30: Directory Coherence H&P Chapter 4 Last Time Snoopy Coherence Symmetric SMP Performance 2 Today Directory- based Coherence 3 A Scalable Approach: Directories One directory entry for
More informationModule 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.
MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line
More informationAn Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors
ACM IEEE 37 th International Symposium on Computer Architecture Elastic Cooperative Caching: An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors Enric Herrero¹, José González²,
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationSuggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!
1! 2! Suggested Readings! Readings!! H&P: Chapter 5.8! Could also look at material on CD referenced on p. 538 of your text! Lecture 27" Cache Coherency! 3! Processor components! Multicore processors and
More informationToken Coherence. Milo M. K. Martin Dissertation Defense
Token Coherence Milo M. K. Martin Dissertation Defense Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet/ University of Wisconsin Madison (C) 2003 Milo Martin Overview Technology and software
More informationInvestigating design tradeoffs in S-NUCA based CMP systems
Investigating design tradeoffs in S-NUCA based CMP systems P. Foglia, C.A. Prete, M. Solinas University of Pisa Dept. of Information Engineering via Diotisalvi, 2 56100 Pisa, Italy {foglia, prete, marco.solinas}@iet.unipi.it
More informationPortland State University ECE 588/688. Directory-Based Cache Coherence Protocols
Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All
More information