Cache Coherence Protocols for Chip Multiprocessors - I

Cache Coherence Protocols for Chip Multiprocessors - I John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 5 6 September 2016

Context Thus far chip multiprocessors hardware threading strategies simultaneous multithreading fine-grain multithreading future microprocessor issues and trends Today: sharing cache in chip multiprocessors 2

Today s References Chapter 6: Coherence Protocols; Chapter 7 Snooping Coherence Protocols; Chapter 8: Directory Coherence Protocols. A Primer on Memory Consistency and Cache Coherence. Daniel J. Sorin, Mark D. Hill, David A. Wood Synthesis Lectures on Computer Architecture. Morgan Claypool. 2011. Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. In Proceedings 32nd International Symposium on Computer Architecture, Madison, WI, June 2005. 4

A Primer on Caching and Coherence 5

Cache A content-addressable memory used to store data items so that future requests will be served faster reduces average latencies to access storage How does a datum become stored in a cache? value written by an earlier computation duplicate of a value available from storage elsewhere Results of load/store operations cache hit requested data is present in cache cache miss requested data is not present in cache 6

Consistency vs. Coherence Consistency models (aka memory models) define correct shared memory behavior in terms of loads and stores (memory reads and writes), without reference to caches or coherence can stores be seen out of order? if so, under what conditions? sequential consistency vs. weak memory models Coherence problems can arise if multiple actors (e.g., multiple cores) have access to multiple copies of a datum (e.g., in multiple caches) and at least one access is a write must appear to be one and only one value per memory location access to stale data (incoherence) is prevented using a coherence protocol set of rules implemented by the distributed actors within a system 7

Goal of Coherence Protocols Maintain coherence by enforcing the following invariants Single-Writer, Multiple-Reader (SWMR) Invariant for any memory location A, at any given time, there exists only a single core that may write to A (that core can also read it) or some number of cores that may only read A Data-Value Invariant the value of the memory location at the start of an epoch is the same as its value at the end of its last read-write epoch 8

Implementing Coherence Invariants Hardware: typical of systems today each cache and the LLC/memory has an associated a finite state machine known as a coherence controller set of controllers form a distributed system controllers exchange messages to ensure that, for each block, the SWMR and data value invariants are maintained at all times Software relies on compiler and/or runtime support may or may not have help from the hardware must be conservative to be safe assume the worst about potential memory aliases of increasing interest concerns about cost of coherence in joules scales well for microprocessors based on tiled designs Intel Scalable Cloud Computer (SCC), 2010 9

Cache Controller Cache controller accepts loads and stores from the core and returns load values to the core On a cache miss, a controller initiates a coherence transaction by issuing a coherence request for the block containing the location accessed by the core Cache controller listens for and responds to coherence requests from other caches Implements a set of finite state machines logically per block and receives and processes events (e.g., incoming coherence messages) depending upon the block s state Figure Credit: Daniel J. Sorin, Mark D. Hill, David A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture. Morgan Claypool. 2011. 10

State Diagram for 4-state Invalidate Protocol MESI states: Modified, Exclusive, Shared, and Invalid Permissible state pairs for a pair of caches M M E S I E S I MESI figure credit: http://sc.tamu.edu/images/mesi.png (Copyright Michael Thomadakis, Texas A&M 2009-2011) 11

Memory Controller Memory controller similar to a cache controller listens for and responds to coherence requests from caches Only a network side does not issue coherence requests (on behalf of loads or stores) receive coherence responses Figure Credit: Daniel J. Sorin, Mark D. Hill, David A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture. Morgan Claypool. 2011. 12

Snooping Coherence Snoopy cache systems broadcast all invalidates and read requests all coherence controllers listen and perform appropriate coherence operations locally Figure Credit: Daniel J. Sorin, Mark D. Hill, David A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture. Morgan Claypool. 2011. 13

Operation of Snoopy Caches Once a datum is tagged modified or exclusive all subsequent operations can be performed locally in cache no external traffic needed If a data item is read by a number of processors transitions to the shared state in all caches all subsequent read operations become local If multiple processors read and update data generate coherence requests on the bus bus is bandwidth limited: imposes a limit on updates per second 14

Directory-based Coherence Snooping protocol: a cache controller initiates a request for a block by broadcasting a request message to all other coherence controllers A directory maintains a global view of each block tracks which caches hold each block and in what states Directory protocol: a cache controller initiates a request for a block by sending it to the memory controller that is the home for that block Figure Credit: Daniel J. Sorin, Mark D. Hill, David A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture. Morgan Claypool. 2011. 15

Intel MESIF Protocol (2005) MESIF: Modified, Exclusive, Shared, Invalid and Forward If a cache line is shared one shared copy of the cache line is in the F state remaining copies of the cache line are in the S state Forward (F) state designates a single copy of data from which further copies can be made cache line in the F state will respond to a request for a copy of the cache line consider how one embodiment of the protocol responds to a read newly created copy is placed in the F state cache line previously in the F state is put in the S or the I state H. Hum et al. US Patent 6,922,756. July 2005. http://bit.ly/gqnkrr 16

Dance-Hall Shared Cache CMPs Niagara1 L1 cache co-located with PE PEs on far side of interconnect from L2 cache each L2 cache equidistant from all cores Figure credit: Niagara: A 32-Way Multithreaded SPARC Processor, P. Kongetira, K. Aingaran, and K. Olukotun, IEEE Micro, pp. 21-29, March-April 2005. 17

Blue Gene/Q s BGC Chip (2012) System on a chip processor, memory, network logic 360mm 2, 1.47B transistors 16 user + 1 service cores + 1 spare core all cores are symmetric 4-way SMT per core Shared L2 cache: 32MB edram multi-versioned cache transactional memory, speculative execution, atomic operations latency ~80 cycles Dual memory controller 16GB external DDR3 memory 1.3GB/s 2 x 16 byte wide interface (+ECC) Chip-to-chip networking integrated router for 5D torus Figure and information credit: Blue Gene/ Q compute chip. Ruud Haring. Hot Chips 23. August 2011. http://bit.ly/qwq1id 18

Emerging Tiled Architectures Trends more processor cores larger cache sizes deeper cache hierarchies Implications wire delay of tens of clock cycles across chip worst case latency: likely unacceptable hit times Tiled chip multiprocessors approach co-locate part of shared cache near each core reduce access latency to (at least some) shared data 19

Tiled Chip Multiprocessors Advantages Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. Simpler replicated physical design readily scale to larger processor counts Can support product families with different number of tiles 20

Alternatives for Managing Tiled L2 in CMPs Treat each slice as a private L2 cache per tile (L2P) Manage all slices as a single large shared L2 cache (L2S) Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 21

Implications of L2 Caching Strategy Manage each slice as a private L2 cache per tile use directory approach to keep caches coherent tags duplicated and distributed across tiles by set index + delivers lowest hit latency works well when your working set fits in your private L2 reduces total effective cache capacity each tile has a local copy of each line it touches can t borrow L2 space from other PEs with less full caches Manage all slices as a single large shared L2 cache focus: NUCA (non-uniform cache architecture) designs differs from dancehall design in Niagara and Blue Gene/Q + shared L2 increases effective cache capacity for shared data incur long hit latencies when L2 data is on remote tile migration-based NUCA protocols seem problematic 22

Victim Replication Combines advantages of private and shared L2$ schemes Variant of shared scheme Attempts to keep copies of local L1$ victims in local L2$ retained victim is a replica of one in an L2 on remote home tile 23

Victim Replication in Action Dynamically build a small victim cache in L2 Processor misses in shared L2 bring line from memory place in L2 in a home tile determined by subset of address bits also bring into L1 of requester Incoming invalidation to a processor follow usual L2S protocol (check local L1 and L2) If L1 line is evicted on conflict or capacity miss attempt to copy victim line into local L2 Primary cache misses must check for local replica on miss no replica: forward request to home tile on replica hit: invalidate replica in local L2, move to local L1 24

Victim Replacement Policy Never evict global shared line in favor of local replica L2VR replaces lines in following priority order invalid line global line with no sharers existing replica If no lines belong to these categories no replica is made in local L2 cache victim evicted from the tile as in L2S More than one candidate line? pick at random 25

Advantages of Victim Replication Hits to replicated copies reduce effective latency of shared L2 Higher effective capacity for shared data than private L2 26

Victim Replication Evaluation Parameters 8-way CMP: 4x2 grid Associativity is 2x #PE Problematic for large tiled CMP? Table credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 27

VR Single-threaded Benchmarks Table credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 28

Single Threaded Access Latencies L2VR adapts to provide 3-level hierarchy: L1, local L2, remote L2 L2S latency is higher than competitors for singlethreaded programs L2VR latency is close to L2P Lower is better! Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 29

Single Threaded Off-chip Miss Rate Lower miss rates than L2P Slightly higher than L2S Lower is better! Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 30

Single Threaded On-chip Coherence Traffic 71% fewer coherence msg hops using L2VR than L2S L2VR comparable to L2P Lower is better! Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 31

VR Multithreaded Benchmarks Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 32

Multi-Threaded Average Access Latency L2 slice = 1MB CG almost fits in private L2 cache; low latency of L2P helps high (9%) L1 miss rate IS fit in L1 BT, FT, LU, SP, apache fit in L2 slice MG, EP, checkers better with L2VR Lower is better! Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 33

Multi-Threaded Off-chip Miss Rates CG almost fits in L2 cache; L1 miss latency of L2P dominates cost of off-chip traffic MG and EP improve with L2VR: they have fewer offchip misses than L2P dbench: high miss rates regardless Lower is better! Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 34

Multi-Threaded On-chip Coherence Traffic L2VR has less coherence traffic than L2S Lower is better! Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 35

MT: Avg % L2$ as Replica Over Time L2VR is adaptive: differs across applications; differs over time Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 36

MT Memory Access Breakdown (L2P, L2S, L2VR) Ideally: low # of misses, most hits in local L2 Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005. 37

Victim Replication Summary Distributed shared L2 caches decrease off-chip traffic vs. private caches at the expense of latency Victim replication reduces on-chip latency by replicating cache lines within same level of cache near threads that are actively accessing the line Result: dynamically self-tuning hybrid between private and shared caches Multithreaded benchmark results summary in most cases, L2VR creates enough replicas so that performance is usually within 5% of L2P L2VR reduces memory latency by avg. of 16% compared to L2S CG is the only case where L2P significantly outperforms both L2S and L2VR (almost fits in private L2) 38

Additional References Victim Caching Jouppi, N. P. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. SIGARCH Comput. Archit. News 18, 3a (Jun. 1990), 364-373. DOI= http://doi.acm.org/10.1145/325096.325162 Shared Caches in Multicores: The Good, The Bad, and The Ugly. Mary Jane Irwin. Athena Award Lecture. International Symposium on Computer Architecture, Saint-Malo, France, June 2010. 39