Page 1. Cache Coherence

Similar documents
Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

EC 513 Computer Architecture

EC 513 Computer Architecture

Dr. George Michelogiannakis. EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

Consistency & Coherence. 4/14/2016 Sec5on 12 Colin Schmidt

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Computer Architecture and Parallel Computing 并行结构与计算. Lecture 6 Coherence Protocols

Symmetric Multiprocessors: Synchronization and Sequential Consistency

CSC 631: High-Performance Computer Architecture

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Computer Architecture

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 18 Cache Coherence

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

CSE502: Computer Architecture CSE 502: Computer Architecture

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Shared Memory Multiprocessors

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Limitations of parallel processing

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Chapter 5. Multiprocessors and Thread-Level Parallelism

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Multiprocessors & Thread Level Parallelism

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Multiprocessors 1. Outline

Scalable Cache Coherence

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology

Cache Coherence and Atomic Operations in Hardware

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Cache Coherence Protocols for Sequential Consistency

Lecture 24: Board Notes: Cache Coherency

Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Lecture 25: Multiprocessors

12 Cache-Organization 1

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Multiprocessors and Locking

M4 Parallelism. Implementation of Locks Cache Coherence

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

Bus-Based Coherent Multiprocessors

Shared Memory Architectures. Shared Memory Multiprocessors. Caches and Cache Coherence. Cache Memories. Cache Memories Write Operation

Lecture 24: Virtual Memory, Multiprocessors

Computer Systems Architecture

Physical Design of Snoop-Based Cache Coherence on Multiprocessors

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Computer Organization

A More Sophisticated Snooping-Based Multi-Processor

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Handout 3 Multiprocessor and thread level parallelism

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Multiprocessor Cache Coherency. What is Cache Coherence?

1. Memory technology & Hierarchy

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Multiprocessors. Key questions: How do parallel processors share data? How do parallel processors coordinate? How many processors?

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Chapter 5. Thread-Level Parallelism

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Shared Memory Architectures. Approaches to Building Parallel Machines

A Basic Snooping-Based Multi-Processor Implementation

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Shared Memory Multiprocessors

Scalable Cache Coherence

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

CS 152 Computer Architecture and Engineering

CMSC 611: Advanced. Distributed & Shared Memory

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

5008: Computer Architecture

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

CSC526: Parallel Processing Fall 2016

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

CS 152 Computer Architecture and Engineering

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

Transcription:

Page 1 Cache Coherence 1

Page 2 Memory Consistency in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has a stale value Do these stale values matter? What is the view of shared memory for programming? 2

Page 3 Write-back Caches & SC T1 is executed prog T1 ST X, 1 ST Y,11 cache-1 X= 1 Y=11 memory X = 0 Y =10 X = Y = cache-2 Y = Y = X = X = prog T2 LD Y, R1 ST Y, R1 LD X, R2 ST X,R2 cache-1 writes back Y X= 1 Y=11 X = 0 Y =11 X = Y = Y = Y = X = X = T2 executed X= 1 Y=11 X = 0 Y =11 X = Y = Y = 11 Y = 11 X = 0 X = 0 cache-1 writes back X X= 1 Y=11 X = 1 Y =11 X = Y = Y = 11 Y = 11 X = 0 X = 0 cache-2 writes back X & Y X= 1 Y=11 X = 1 Y =11 X = 0 Y =11 Y = Y = X = X = 3

Page 4 Write-through Caches & SC prog T1 ST X, 1 ST Y,11 cache-1 X= 0 Y=10 memory X = 0 Y =10 X = Y = cache-2 Y = Y = X = 0 X = prog T2 LD Y, R1 ST Y, R1 LD X, R2 ST X,R2 T1 executed X= 1 Y=11 X = 1 Y =11 X = Y = Y = Y = X = X = 0 T2 executed X= 1 X = 1 Y = 11 Y=11 Y =11 X = 0 Y = 11 Y = 11 X = 0 X = 0 Write-through caches don t preserve sequential consistency either 4

Page 5 Maintaining Sequential Consistency SC sufficient for correct producer-consumer and mutual exclusion code (e.g., Dekker) Multiple copies of a location in various caches can cause SC to break down. Hardware support is required such that only one processor at a time has write permission for a location no processor can load a stale copy of the location after a write cache coherence protocols 5

Page 6 A System with Multiple Caches P L1 P L1 L2 P L1 P L1 L2 P L1 P L1 Interconnect M Modern systems often have hierarchical caches Each cache has exactly one parent but can have zero or more children Only a parent and its children can communicate directly Inclusion property is maintained between a parent and its children, i.e., a L i a L i+1 6

Page 7 Cache Coherence Protocols for SC write request: the address is invalidated (updated) in all other caches before (after) the write is performed read request: if a dirty copy is found in some cache, a write-back is performed before the memory is read We will focus on Invalidation protocols as opposed to Update protocols 7 Update protocols, or write broadcast. Latency between writing a word in one processor and reading it in another is usually smaller in a write update scheme. But since bandwidth is more precious, most multiprocessors use a write invalidate scheme.

Warmup: Parallel I/O Proc. Address (A) Data (D) Cache Memory Bus Physical Memory R/W Either Cache or DMA can be the Bus Master and effect transfers A D R/W Page transfers occur while Processor runs DMA DISK DMA stands for Direct Memory Access 8 Page 8

Page 9 Problems with Parallel I/O Proc. Memory Cached portions of page Cache Memory Bus DMA Physical Memory DMA transfers Disk: Physical memory may be stale if Cache is. DISK Disk Memory: Cache may have data corresponding to. 9

Page 10 Snoopy Cache [ Goodman 1983 ] Idea: Have cache watch (or snoop upon) DMA transfers, and then do the right thing Snoopy cache tags are dual-ported Used to drive Memory Bus when Cache is Bus Master Proc. A R/W D Tags and State Data (lines) A R/W Snoopy read port attached to Memory Bus Cache 10 A snoopy cache works in analogy to your snoopy next door neighbor, who is always watching to see what you're doing, and interfering with your life. In the case of the snoopy cache, the caches are all watching the bus for transactions that affect blocks that are in the cache at the moment. The analogy breaks down here; the snoopy cache only does something if your actions actually affect it, while the snoopy neighbor is always interested in what you're up to.

Snoopy Cache Actions Observed Bus Cycle Cache State Cache Action Address not cached. Read Cycle Cached, unmodified. Memory Disk Cached, modified. Address not cached. Write Cycle Cached, unmodified. Disk Memory Cached, modified. 11 Page 11

Page 12 Shared Memory Multiprocessor Memory Bus M 1 Snoopy Cache Physical Memory M 2 Snoopy Cache M 3 Snoopy Cache DMA DISKS Use snoopy mechanism to keep all processors view of memory coherent 12

Page 13 Cache State Transition Diagram Each cache line tag state bits M 1 write or read Address tag Other processor read/ Write back line Read miss Read by any processor M S M 1 write M 1 intent to write Other processor intent to write M: Modified Exclusive E: Exclusive, unmodified S: Shared I: Invalid E I M 1 read Write miss Other processor intent to write 13

Page 14 2 Processor Example Block b M 1 M 1 write or read M 2 read, Write back line Read miss M S M 1 write M 1 intent to write M 2 intent to write E I M 1 read Write miss M 2 intent to write Block b M 2 M 2 write or read M 1 read, Write back line Read miss M S M 2 write M 2 intent to write M 1 intent to write E I M 2 read Write miss M 1 intent to write 14

Page 15 Observation M 1 write or read Other processor read/ Write back line Read miss Read by any processor M S M 1 write M 1 intent to write Other processor intent to write E I M 1 read Write miss Other processor intent to write If a line is in M or E, then no other cache has a valid copy of the line! Memory stays coherent, multiple differing copies cannot exist 15

Page 16 Cache Coherence State Encoding block Address tag index m offset tag V M data block = Valid and dirty bits can be used to encode S, I, and (E, M) states V=0, D=x Invalid V=1, D=0 Shared (not dirty) V=1, D=1 Exclusive (dirty) Hit? word 16 What does it mean to merge E, M states?

Page 17 2-Level Caches CPU CPU CPU CPU L1 $ L1 $ L1 $ L1 $ L2 $ L2 $ L2 $ L2 $ Snooper Snooper Snooper Snooper Processors often have two-level caches Small L1 on chip, large L2 off chip Inclusion property: entries in L1 must be in L2 invalidation in L2 invalidation in L1 Snooping on L2 does not affect CPU-L1 bandwidth What problem could occur? 17 Interlocks are required when both CPU-L1 and L2-Bus interactions involve the same address.

Page 18 Intervention CPU-1 CPU-2 A 200 cache-1 CPU-Memory bus cache-2 A 100 Memory (stale) When a read-miss for A occurs in cache-2, a read request for A is placed on the bus Cache-1 needs to supply & change its state to shared The memory may respond to the request also! does memory know it has stale data? Cache-1 needs to intervene through memory controller to supply correct data to cache-2 18

Page 19 False Sharing state blk addr data0 data1... datan A cache block contains more than one word Cache-coherence is done at the block-level and not word-level Suppose M 1 writes word i and M 2 writes word k and both words have the same block address. What can happen? 19 The block may be invalidated many times unnecessarily because the addresses share a common block.

Page 20 Synchronization and Caches: Performance Issues Processor 1 R 1 L: swap(mutex, R); if <R> then goto L; <critical section> M[mutex] 0; cache Processor 2 R 1 L: swap(mutex, R); if <R> then goto L; <critical section> M[mutex] 0; mutex=1 Processor 3 R 1 L: swap(mutex, R); if <R> then goto L; <critical section> M[mutex] 0; cache CPU-Memory Bus Cache-coherence protocols will cause mutex to ping-pong between P1 s and P2 s caches. Ping-ponging can be reduced by first reading the mutex location (non-atomically) and executing a swap only if it is found to be zero. 20

Page 21 Performance Related to Bus occupancy In general, a read-modify-write instruction requires two memory (bus) operations without intervening memory operations by other processors In a multiprocessor setting, bus needs to be locked for the entire duration of the atomic read and write operation expensive for simple buses very expensive for split-transaction buses modern processors use load-reserve store-conditional 21 Split transaction bus has a read-request transaction followed by a Memory-reply transaction that contains the data. Split transactions make the bus available for other masters While the memory reads the words of the requested address. It also normally means that the CPU must arbitrate for the bus To request the data and memory must arbitrate for the bus to Return the data. Each transaction must be tagged. Split Transaction buses have higher bandwidth and higher latency.

Page 22 Load-reserve & Store-conditional Special register(s) to hold reservation flag and address, and the outcome of store-conditional Load-reserve(R, a): <flag, adr> <1, a>; R M[a]; Store-conditional(a, R): if <flag, adr> == <1, a> then cancel other procs reservation on a; M[a] <R>; status succeed; else status fail; If the snooper sees a store transaction to the address in the reserve register, the reserve bit is set to 0 Several processors may reserve a simultaneously These instructions are like ordinary loads and stores with respect to the bus traffic 22

Page 23 Performance: Load-reserve & Store-conditional The total number of memory (bus) transactions is not necessarily reduced, but splitting an atomic instruction into load-reserve & store-conditional: increases bus utilization (and reduces processor stall time), especially in splittransaction buses reduces cache ping-pong effect because processors trying to acquire a semaphore do not have to perform a store each time 23

Page 24 Out-of-Order Loads/Stores & CC snooper (Wb-req, Inv-req, Inv-rep) CPU load/store buffers Cache pushout (Wb-rep) Memory (I/S/E) (S-rep, E-rep) (S-req, E-req) Blocking caches One request at a time + CC SC Non-blocking caches Multiple requests (different addresses) concurrently + CC Relaxed memory models CC ensures that all processors observe the same order of loads and stores to an address 24