Cache Coherence Protocols for Chip Multiprocessors - I

Similar documents
Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Shared Symmetric Memory Systems

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Computer Systems Architecture

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Multiprocessor Cache Coherency. What is Cache Coherence?

Multiprocessors & Thread Level Parallelism

Computer Systems Architecture

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Foundations of Computer Systems

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Computer Architecture

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Bandwidth Adaptive Snooping

12 Cache-Organization 1

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture 11: Large Cache Design

The Cache Write Problem

Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

Kaisen Lin and Michael Conley

Portland State University ECE 588/688. Cache Coherence Protocols

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

Chapter 9 Multiprocessors

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

A Study of Cache Organizations for Chip- Multiprocessors

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

CMSC 611: Advanced. Distributed & Shared Memory

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessor Synchronization

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Rethinking Last-Level Cache Management for Multicores Operating at Near-Threshold

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Lecture 1: Introduction

EECS 570 Final Exam - SOLUTIONS Winter 2015

CSC 631: High-Performance Computer Architecture

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

COSC4201 Multiprocessors

CMSC 611: Advanced Computer Architecture

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it!

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Flynn s Classification

Computer Science 146. Computer Architecture

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs

Simultaneous Multithreading and the Case for Chip Multiprocessing

ECE 485/585 Microprocessor System Design

Handout 3 Multiprocessor and thread level parallelism

Lecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Lecture 24: Virtual Memory, Multiprocessors

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

1. Memory technology & Hierarchy

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

CS 654 Computer Architecture Summary. Peter Kemper

Limitations of parallel processing

Portland State University ECE 588/688. Cray-1 and Cray T3E

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

ECE/CS 757: Homework 1

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

One-Level Cache Memory Design for Scalable SMT Architectures

EE382 Processor Design. Processor Issues for MP

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

ECSE 425 Lecture 30: Directory Coherence

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors

COSC 6385 Computer Architecture - Multi Processor Systems

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Token Coherence. Milo M. K. Martin Dissertation Defense

Investigating design tradeoffs in S-NUCA based CMP systems

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Transcription:

Cache Coherence Protocols for Chip Multiprocessors - I John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 5 6 September 2016

Context Thus far chip multiprocessors hardware threading strategies simultaneous multithreading fine-grain multithreading future microprocessor issues and trends Today: sharing cache in chip multiprocessors 2

Context Thus far chip multiprocessors hardware threading strategies simultaneous multithreading fine-grain multithreading future microprocessor issues and trends Today: sharing cache in chip multiprocessors cache coherence victim replication 3

Today s References Chapter 6: Coherence Protocols; Chapter 7 Snooping Coherence Protocols; Chapter 8: Directory Coherence Protocols. A Primer on Memory Consistency and Cache Coherence. Daniel J. Sorin, Mark D. Hill, David A. Wood Synthesis Lectures on Computer Architecture. Morgan Claypool. 2011. Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. In Proceedings 32nd International Symposium on Computer Architecture, Madison, WI, June 2005. 4

A Primer on Caching and Coherence 5

Cache A content-addressable memory used to store data items so that future requests will be served faster reduces average latencies to access storage How does a datum become stored in a cache? value written by an earlier computation duplicate of a value available from storage elsewhere Results of load/store operations cache hit requested data is present in cache cache miss requested data is not present in cache 6

Consistency vs. Coherence Consistency models (aka memory models) define correct shared memory behavior in terms of loads and stores (memory reads and writes), without reference to caches or coherence can stores be seen out of order? if so, under what conditions? sequential consistency vs. weak memory models Coherence problems can arise if multiple actors (e.g., multiple cores) have access to multiple copies of a datum (e.g., in multiple caches) and at least one access is a write must appear to be one and only one value per memory location access to stale data (incoherence) is prevented using a coherence protocol set of rules implemented by the distributed actors within a system 7

Goal of Coherence Protocols Maintain coherence by enforcing the following invariants Single-Writer, Multiple-Reader (SWMR) Invariant for any memory location A, at any given time, there exists only a single core that may write to A (that core can also read it) or some number of cores that may only read A Data-Value Invariant the value of the memory location at the start of an epoch is the same as its value at the end of its last read-write epoch 8

Implementing Coherence Invariants Hardware: typical of systems today each cache and the LLC/memory has an associated a finite state machine known as a coherence controller set of controllers form a distributed system controllers exchange messages to ensure that, for each block, the SWMR and data value invariants are maintained at all times Software relies on compiler and/or runtime support may or may not have help from the hardware must be conservative to be safe assume the worst about potential memory aliases of increasing interest concerns about cost of coherence in joules scales well for microprocessors based on tiled designs Intel Scalable Cloud Computer (SCC), 2010 9

Cache Controller Cache controller accepts loads and stores from the core and returns load values to the core On a cache miss, a controller initiates a coherence transaction by issuing a coherence request for the block containing the location accessed by the core Cache controller listens for and responds to coherence requests from other caches Implements a set of finite state machines logically per block and receives and processes events (e.g., incoming coherence messages) depending upon the block s state Figure Credit: Daniel J. Sorin, Mark D. Hill, David A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture. Morgan Claypool. 2011. 10

State Diagram for 4-state Invalidate Protocol MESI states: Modified, Exclusive, Shared, and Invalid Permissible state pairs for a pair of caches M M E S I E S I MESI figure credit: http://sc.tamu.edu/images/mesi.png (Copyright Michael Thomadakis, Texas A&M 2009-2011) 11

Memory Controller Memory controller similar to a cache controller listens for and responds to coherence requests from caches Only a network side does not issue coherence requests (on behalf of loads or stores) receive coherence responses Figure Credit: Daniel J. Sorin, Mark D. Hill, David A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture. Morgan Claypool. 2011. 12

Snooping Coherence Snoopy cache systems broadcast all invalidates and read requests all coherence controllers listen and perform appropriate coherence operations locally Figure Credit: Daniel J. Sorin, Mark D. Hill, David A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture. Morgan Claypool. 2011. 13

Operation of Snoopy Caches Once a datum is tagged modified or exclusive all subsequent operations can be performed locally in cache no external traffic needed If a data item is read by a number of processors transitions to the shared state in all caches all subsequent read operations become local If multiple processors read and update data generate coherence requests on the bus bus is bandwidth limited: imposes a limit on updates per second 14

Directory-based Coherence Snooping protocol: a cache controller initiates a request for a block by broadcasting a request message to all other coherence controllers A directory maintains a global view of each block tracks which caches hold each block and in what states Directory protocol: a cache controller initiates a request for a block by sending it to the memory controller that is the home for that block Figure Credit: Daniel J. Sorin, Mark D. Hill, David A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture. Morgan Claypool. 2011. 15

Intel MESIF Protocol (2005) MESIF: Modified, Exclusive, Shared, Invalid and Forward If a cache line is shared one shared copy of the cache line is in the F state remaining copies of the cache line are in the S state Forward (F) state designates a single copy of data from which further copies can be made cache line in the F state will respond to a request for a copy of the cache line consider how one embodiment of the protocol responds to a read newly created copy is placed in the F state cache line previously in the F state is put in the S or the I state H. Hum et al. US Patent 6,922,756. July 2005. http://bit.ly/gqnkrr 16

Dance-Hall Shared Cache CMPs Niagara1 L1 cache co-located with PE PEs on far side of interconnect from L2 cache each L2 cache equidistant from all cores Figure credit: Niagara: A 32-Way Multithreaded SPARC Processor, P. Kongetira, K. Aingaran, and K. Olukotun, IEEE Micro, pp. 21-29, March-April 2005. 17

Blue Gene/Q s BGC Chip (2012) System on a chip processor, memory, network logic 360mm 2, 1.47B transistors 16 user + 1 service cores + 1 spare core all cores are symmetric 4-way SMT per core Shared L2 cache: 32MB edram multi-versioned cache transactional memory, speculative execution, atomic operations latency ~80 cycles Dual memory controller 16GB external DDR3 memory 1.3GB/s 2 x 16 byte wide interface (+ECC) Chip-to-chip networking integrated router for 5D torus Figure and information credit: Blue Gene/ Q compute chip. Ruud Haring. Hot Chips 23. August 2011. http://bit.ly/qwq1id 18

Emerging Tiled Architectures Trends more processor cores larger cache sizes deeper cache hierarchies Implications wire delay of tens of clock cycles across chip worst case latency: likely unacceptable hit times Tiled chip multiprocessors approach co-locate part of shared cache near each core reduce access latency to (at least some) shared data 19

Tiled Chip Multiprocessors Advantages Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. Simpler replicated physical design readily scale to larger processor counts Can support product families with different number of tiles 20

Alternatives for Managing Tiled L2 in CMPs Treat each slice as a private L2 cache per tile (L2P) Manage all slices as a single large shared L2 cache (L2S) Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 21

Implications of L2 Caching Strategy Manage each slice as a private L2 cache per tile use directory approach to keep caches coherent tags duplicated and distributed across tiles by set index + delivers lowest hit latency works well when your working set fits in your private L2 reduces total effective cache capacity each tile has a local copy of each line it touches can t borrow L2 space from other PEs with less full caches Manage all slices as a single large shared L2 cache focus: NUCA (non-uniform cache architecture) designs differs from dancehall design in Niagara and Blue Gene/Q + shared L2 increases effective cache capacity for shared data incur long hit latencies when L2 data is on remote tile migration-based NUCA protocols seem problematic 22

Victim Replication Combines advantages of private and shared L2$ schemes Variant of shared scheme Attempts to keep copies of local L1$ victims in local L2$ retained victim is a replica of one in an L2 on remote home tile 23

Victim Replication in Action Dynamically build a small victim cache in L2 Processor misses in shared L2 bring line from memory place in L2 in a home tile determined by subset of address bits also bring into L1 of requester Incoming invalidation to a processor follow usual L2S protocol (check local L1 and L2) If L1 line is evicted on conflict or capacity miss attempt to copy victim line into local L2 Primary cache misses must check for local replica on miss no replica: forward request to home tile on replica hit: invalidate replica in local L2, move to local L1 24

Victim Replacement Policy Never evict global shared line in favor of local replica L2VR replaces lines in following priority order invalid line global line with no sharers existing replica If no lines belong to these categories no replica is made in local L2 cache victim evicted from the tile as in L2S More than one candidate line? pick at random 25

Advantages of Victim Replication Hits to replicated copies reduce effective latency of shared L2 Higher effective capacity for shared data than private L2 26

Victim Replication Evaluation Parameters 8-way CMP: 4x2 grid Associativity is 2x #PE Problematic for large tiled CMP? Table credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 27

VR Single-threaded Benchmarks Table credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 28

Single Threaded Access Latencies L2VR adapts to provide 3-level hierarchy: L1, local L2, remote L2 L2S latency is higher than competitors for singlethreaded programs L2VR latency is close to L2P Lower is better! Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 29

Single Threaded Off-chip Miss Rate Lower miss rates than L2P Slightly higher than L2S Lower is better! Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 30

Single Threaded On-chip Coherence Traffic 71% fewer coherence msg hops using L2VR than L2S L2VR comparable to L2P Lower is better! Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 31

VR Multithreaded Benchmarks Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 32

Multi-Threaded Average Access Latency L2 slice = 1MB CG almost fits in private L2 cache; low latency of L2P helps high (9%) L1 miss rate IS fit in L1 BT, FT, LU, SP, apache fit in L2 slice MG, EP, checkers better with L2VR Lower is better! Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 33

Multi-Threaded Off-chip Miss Rates CG almost fits in L2 cache; L1 miss latency of L2P dominates cost of off-chip traffic MG and EP improve with L2VR: they have fewer offchip misses than L2P dbench: high miss rates regardless Lower is better! Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 34

Multi-Threaded On-chip Coherence Traffic L2VR has less coherence traffic than L2S Lower is better! Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 35

MT: Avg % L2$ as Replica Over Time L2VR is adaptive: differs across applications; differs over time Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 36

MT Memory Access Breakdown (L2P, L2S, L2VR) Ideally: low # of misses, most hits in local L2 Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 37

Victim Replication Summary Distributed shared L2 caches decrease off-chip traffic vs. private caches at the expense of latency Victim replication reduces on-chip latency by replicating cache lines within same level of cache near threads that are actively accessing the line Result: dynamically self-tuning hybrid between private and shared caches Multithreaded benchmark results summary in most cases, L2VR creates enough replicas so that performance is usually within 5% of L2P L2VR reduces memory latency by avg. of 16% compared to L2S CG is the only case where L2P significantly outperforms both L2S and L2VR (almost fits in private L2) 38

Additional References Victim Caching Jouppi, N. P. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. SIGARCH Comput. Archit. News 18, 3a (Jun. 1990), 364-373. DOI= http://doi.acm.org/10.1145/325096.325162 Shared Caches in Multicores: The Good, The Bad, and The Ugly. Mary Jane Irwin. Athena Award Lecture. International Symposium on Computer Architecture, Saint-Malo, France, June 2010. 39