Computer Architecture and Parallel Computing 并行结构与计算. Lecture 6 Coherence Protocols

Similar documents
CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Consistency & Coherence. 4/14/2016 Sec5on 12 Colin Schmidt

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Dr. George Michelogiannakis. EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

Page 1. Cache Coherence

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Symmetric Multiprocessors: Synchronization and Sequential Consistency

CS 152 Computer Architecture and Engineering. Lecture 19: Directory- Based Cache Protocols. Recap: Snoopy Cache Protocols

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 18 Cache Coherence

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

EC 513 Computer Architecture

Announcements. ECE4750/CS4420 Computer Architecture L17: Memory Model. Edward Suh Computer Systems Laboratory

CSC 631: High-Performance Computer Architecture

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

CS252 Spring 2017 Graduate Computer Architecture. Lecture 15: Synchronization and Memory Models Part 2

CS252 Graduate Computer Architecture Fall 2015 Lecture 14: Synchroniza>on and Memory Models

EC 513 Computer Architecture

Cache Coherence Protocols for Sequential Consistency

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Problem Set 5 Solutions CS152 Fall 2016

CS 152, Spring 2011 Section 12

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Handout 3 Multiprocessor and thread level parallelism

Outline. EEL 5764 Graduate Computer Architecture. Chapter 4 - Multiprocessors and TLP. Déjà vu all over again?

Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 19 Memory Consistency Models

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Parallel Processors from Client to Cloud Part 2

Computer Architecture

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessors & Thread Level Parallelism

CSE502: Computer Architecture CSE 502: Computer Architecture

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Scalable Cache Coherence

Computer Architecture and Engineering CS152 Quiz #5 May 2th, 2013 Professor Krste Asanović Name: <ANSWER KEY>

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Computer Systems Architecture

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Chapter 5. Multiprocessors and Thread-Level Parallelism

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

CS5460: Operating Systems

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

1. Memory technology & Hierarchy

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Computer Science 146. Computer Architecture

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

EE382 Processor Design. Processor Issues for MP

A Cache Coherence Protocol to Implement Sequential Consistency. Memory Consistency in SMPs

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

??%/year. ! All microprocessor companies switch to MP (2X CPUs / 2 yrs) " Procrastination penalized: 2X sequential perf. / 5 yrs

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

5008: Computer Architecture

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Computer System Architecture Final Examination Spring 2002

Lecture 25: Multiprocessors

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Shared Memory Multiprocessors

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

Beyond Sequential Consistency: Relaxed Memory Models

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Lecture 24: Virtual Memory, Multiprocessors

5 Chip Multiprocessors (II) Robert Mullins

CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

Foundations of Computer Systems

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lecture 25: Thread Level Parallelism -- Synchronization and Memory Consistency

12 Cache-Organization 1

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

Multiprocessors and Locking

Computer Architecture

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Transcription:

Computer Architecture and Parallel Computing 并行结构与计算 Lecture 6 Coherence Protocols Peng Liu ( 刘鹏 ) College of Information Science and Electronic Engineering Zhejiang University, Hangzhou 310027, China liupeng@zju.edu.cn 浙江大学信息与电子工程学院

Last time in Lecture 05 Superscalar Multithreading 2

Parallel Processing: Déjà vu all over again? today s processors are nearing an impasse as technologies approach the speed of light.. David Mitchell, The Transputer: The Time Is Now (1989) Transputer had bad timing (Uniprocessor performance ) Procrastination rewarded: 2X seq. perf. / 1.5 years We are dedicating all of our future product development to multicore designs. This is a sea change in computing Paul Otellini, President, Intel (2005) All microprocessor companies switch to MP (2+ CPUs/2 yrs) Procrastination penalized: 2X sequential perf. / 5 yrs Even handheld systems moving to multicore Pad 4, &iphone6 have two+ cores each Playstation Portable has 9 cores (cell) 3

Symmetric Multiprocessors Processor Processor CPU-Memory bus bridge Memory symmetric All memory is equally far away from all processors Any processor can do any I/O (set up a DMA transfer) I/O controller I/O bus I/O controller Graphics output I/O controller Networks 4

Synchronization The need for synchronization arises whenever there are concurrent processes in a system (even in a uniprocessor system) producer Producer-Consumer: A consumer process must wait until the producer process has produced data consumer Mutual Exclusion: Ensure that only one process uses a resource at a given time P1 P2 Shared Resource 5

A Producer-Consumer Example Producer tail head Consumer R tail R tail R head R Producer posting Item x: Load R tail, (tail) Store (R tail ), x R tail =R tail +1 Store (tail), R tail The program is written assuming instructions are executed in order. Consumer: Load R head, (head) spin: Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head =R head +1 Store (head), R head process(r) Problems? 6

Sequential Consistency A Memory Model P P P P P P M A system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program Leslie Lamport Sequential Consistency = arbitrary order-preserving interleaving of memory references of sequential programs 8

Sequential Consistency Sequential concurrent tasks: T1, T2 Shared variables: X, Y (initially X = 0, Y = 10) T1: T2: Store (X), 1 (X = 1) Load R 1, (Y) Store (Y), 11 (Y = 11) Store (Y ), R 1 (Y = Y) Load R 2, (X) Store (X ), R 2 (X = X) what are the legitimate answers for X and Y? (X,Y ) {(1,11), (0,10), (1,10), (0,11)}? If y is 11 then x cannot be 0 9

Sequential Consistency Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( ) What are these in our example? T1: T2: Store (X), 1 (X = 1) Load R 1, (Y) Store (Y), 11 (Y = 11) Store (Y ), R 1 (Y = Y) Load R 2, (X) additional SC requirements Store (X ), R 2 (X = X) Does (can) a system with caches or out-of-order execution capability provide a sequentiallyconsistent view of the memory? more on this later 10

Multiple Consumer Example Producer tail head Consumer 1 R head R tail R R tail Producer posting Item x: Load R tail, (tail) Store (R tail ), x R tail =R tail +1 Store (tail), R tail Critical section: Needs to be executed atomically by one consumer locks Consumer 2 R head R tail Consumer: Load R head, (head) spin: Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head =R head +1 Store (head), R head process(r) What is wrong with this code? R 11

Locks or Semaphores E. W. Dijkstra, 1965 A semaphore is a non-negative integer, with the following operations: P(s): if s>0, decrement s by 1, otherwise wait V(s): increment s by 1 and wake up one of the waiting processes P s and V s must be executed atomically, i.e., without interruptions or interleaved accesses to s by other processors Process i P(s) <critical section> V(s) initial value of s determines the maximum no. of processes in the critical section 12

Implementation of Semaphores Semaphores (mutual exclusion) can be implemented using ordinary Load and Store instructions in the Sequential Consistency memory model. However, protocols for mutual exclusion are difficult to design... Simpler solution: atomic read-modify-write instructions Examples: m is a memory location, R is a register Test&Set (m), R: R M[m]; if R==0 then M[m] 1; Fetch&Add (m), R V, R: R M[m]; M[m] R + R V ; Swap (m), R: R t M[m]; M[m] R; R R t ; 13

Multiple Consumers Example using the Test&Set Instruction P: Test&Set (mutex),r temp if (R temp!=0) goto P Load R head, (head) spin: Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head =R head +1 Store (head), R head V: Store (mutex),0 process(r) Critical Section Other atomic read-modify-write instructions (Swap, Fetch&Add, etc.) can also implement P s and V s What if the process stops or is swapped out while in the critical section? 14

Nonblocking Synchronization Compare&Swap(m), R t, R s : if (R t ==M[m]) then M[m]=R s ; R s =R t ; status success; else status fail; status is an implicit argument try: Load R head, (head) spin: Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R newhead = R head +1 Compare&Swap(head), R head, R newhead if (status==fail) goto try process(r) 15

Load-reserve & Store-conditional Special register(s) to hold reservation flag and address, and the outcome of store-conditional Load-reserve R, (m): <flag, adr> <1, m>; R M[m]; Store-conditional (m), R: if<flag, adr> == <1, m> then cancel other procs reservation on m; M[m] R; status succeed; else status fail; try: spin: Load-reserve R head, (head) Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head = R head + 1 Store-conditional (head), R head if (status==fail) goto try process(r) 16

Performance of Locks Blocking atomic read-modify-write instructions e.g., Test&Set, Fetch&Add, Swap vs Non-blocking atomic read-modify-write instructions e.g., Compare&Swap, Load-reserve/Store-conditional vs Protocols based on ordinary Loads and Stores Performance depends on several interacting factors: degree of contention, caches, out-of-order execution of Loads and Stores later... 17

Issues in Implementing Sequential Consistency P P P P P P Implementation of SC is complicated by two issues Out-of-order execution capability Load(a); Load(b) yes Load(a); Store(b) yes if a b Store(a); Load(b) yes if a b Store(a); Store(b) yes if a b M Caches Caches can prevent the effect of a store from being seen by other processors No common commercial architecture has a sequentially consistent memory model! 18

Memory Fences Instructions to sequentialize memory accesses Processors with relaxed or weak memory models (i.e., permit Loads and Stores to different addresses to be reordered) need to provide memory fence instructions to force the serialization of memory accesses Examples of processors with relaxed memory models: Sparc V8 (TSO,PSO): Membar Sparc V9 (RMO): Membar #LoadLoad, Membar #LoadStore Membar #StoreLoad, Membar #StoreStore PowerPC (WO): Sync, EIEIO Memory fences are expensive operations, however, one pays the cost of serialization only when it is required 19

Using Memory Fences Producer tail head Consumer R tail R tail R head R Producer posting Item x: Load R tail, (tail) Store (R tail ), x Membar SS R tail =R tail +1 Store (tail), R tail ensures that tail ptr is not updated before x has been stored ensures that R is not loaded before x has been stored Consumer: Load R head, (head) spin: Load R tail, (tail) if R head ==R tail goto spin Membar LL Load R, (R head ) R head =R head +1 Store (head), R head process(r) 20

Mutual Exclusion Using Load/Store A protocol based on two shared variables c1 and c2. Initially, both c1 and c2 are 0 (not busy) Process 1... c1=1; L: if c2=1 then go to L < critical section> c1=0; Process 2... c2=1; L: if c1=1 then go to L < critical section> c2=0; What is wrong? Deadlock! 21

Mutual Exclusion: second attempt To avoid deadlock, let a process give up the reservation (i.e., Process 1 sets c1 to 0) while waiting. Process 1... L: c1=1; if c2=1 then { c1=0; go to L} < critical section> c1=0 Process 2... L: c2=1; if c1=1 then { c2=0; go to L} < critical section> c2=0 Deadlock is not possible but with a low probability a livelock may occur. An unlucky process may never get to enter the critical section starvation 22

A Protocol for Mutual Exclusion T. Dekker, 1966 A protocol based on 3 shared variables c1, c2 and turn. Initially, both c1 and c2 are 0 (not busy) Process 1... c1=1; turn = 1; L: if c2=1 & turn=1 then go to L < critical section> c1=0; Process 2... c2=1; turn = 2; L: if c1=1 & turn=2 then go to L < critical section> c2=0; turn = i ensures that only process i can wait variables c1 and c2 ensure mutual exclusion Solution for n processes was given by Dijkstra and is quite tricky! 23

N-process Mutual Exclusion Lamport s Bakery Algorithm Process i Entry Code choosing[i] = 1; num[i] = max(num[0],, num[n-1]) + 1; choosing[i] = 0; for(j = 0; j < N; j++) { while( choosing[j] ); while( num[j] && ( ( num[j] < num[i] ) ( num[j] == num[i] && j < i ) ) ); } Exit Code num[i] = 0; Initiallynum[j] = 0, for all j 25

Relaxed Memory Model needs Fences Producer tail head Consumer R tail R tail R head R Producer posting Item x: Load R tail, (tail) Store (R tail ), x Membar SS R tail =R tail +1 Store (tail), R tail ensures that tail ptr is not updated before x has been stored ensures that R is not loaded before x has been stored Consumer: Load R head, (head) spin: Load R tail, (tail) if R head ==R tail goto spin Membar LL Load R, (R head ) R head =R head +1 Store (head), R head process(r) 26

Memory Coherence in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has a stale value Do these stale values matter? What is the view of shared memory for programming? 27

Write-back Caches & SC cache-1 memory cache-2 T1 is executed prog T1 ST X, 1 ST Y,11 cache-1 writes backy X= 1 Y=11 X= 1 Y=11 X = 0 Y =10 X = Y = X = 0 Y =11 X = Y = Y = Y = X = X = Y = Y = X = X = prog T2 LD Y, R1 ST Y, R1 LD X, R2 ST X,R2 T2 executed X= 1 Y=11 X = 0 Y =11 X = Y = Y = 11 Y = 11 X = 0 X = 0 cache-1 writes backx X= 1 Y=11 X = 1 Y =11 X = Y = Y = 11 Y = 11 X = 0 X = 0 cache-2 writes back X &Y X= 1 Y=11 X = 1 Y =11 X = 0 Y =11 Y =11 Y =11 X = 0 X = 0 28

Write-through Caches & SC prog T1 ST X, 1 ST Y,11 cache-1 X= 0 Y=10 memory X = 0 Y =10 X = Y = cache-2 Y = Y = X = 0 X = prog T2 LD Y, R1 ST Y, R1 LD X, R2 ST X,R2 T1 executed X= 1 Y=11 X = 1 Y =11 X = Y = Y = Y = X = 0 X = T2 executed X= 1 Y=11 X = 1 Y =11 X = 0 Y =11 Y = 11 Y = 11 X = 0 X = 0 Write-through caches don t preserve sequential consistency either 29

Cache Coherence vs. Memory Consistency A cache coherence protocol ensures that all writes by one processor are eventually visible to other processors, for one memory address i.e., updates are not lost A memory consistency model gives the rules on when a write by one processor can be observed by a read on another, across different addresses Equivalently, what values can be seen by a load A cache coherence protocol is not enough to ensure sequential consistency But if sequentially consistent, then caches must be coherent Combination of cache coherence protocol plus processor memory reorder buffer implements a given machine s memory consistency model 30

Maintaining Cache Coherence Hardware support is required such that only one processor at a time has write permission for a location no processor can load a stale copy of the location after a write cache coherence protocols 31

Warmup: Parallel I/O Proc. Address (A) Data (D) Cache Memory Bus Physical Memory R/W Either Cache or DMA can be the Bus Master and effect transfers A D R/W Page transfers occur while the Processor is running DMA DISK (DMA stands for Direct Memory Access, means the I/O device can read/write memory autonomous from the CPU) 32

Problems with Parallel I/O Proc. Cached portions of page Cache Memory Bus Physical Memory DMA transfers DMA Memory DISK Disk: Physical memory may be stale if cache copy is dirty Disk Memory: Cache may hold stale data and not see memory writes 33

Snoopy Cache Goodman 1983 Idea: Have cache watch (or snoop upon) DMA transfers, and then do the right thing Snoopy cache tags are dual-ported Used to drive Memory Bus when Cache is Bus Master Proc. A R/W D Tags and State Data (lines) A R/W Snoopy read port attached to Memory Bus Cache 34

Shared Memory Multiprocessor Memory Bus M 1 Snoopy Cache Physical Memory M 2 Snoopy Cache M 3 Snoopy Cache DMA DISKS Use snoopy mechanism to keep all processors view of memory coherent 35

Snoopy Cache Coherence Protocols write miss: the address is invalidated in all other caches before the write is performed read miss: if a dirty copy is found in some cache, a writeback is performed before the memory is read 36

Cache State Transition Diagram The MSI protocol Each cache line has state bits state bits Address tag Write miss (P1 gets line from memory) Other processor reads (P 1 writes back) M: Modified S: Shared I: Invalid M P 1 reads or writes Read miss (P1 gets line from memory) Read by any processor S Other processor intent to write I Other processor intent to write (P 1 writes back) Cache state in processor P 1 37

Two Processor Example (Reading and writing the same cache line) P 1 reads P 1 writes P 2 reads P 2 writes P 1 reads P 1 writes P 2 writes P 1 writes P 1 Read miss P 2 reads, P 1 writes back S P 2 intent to write M I P 1 reads or writes Write miss P 2 intent to write P 2 P 1 reads, P 2 writes back M P 2 reads or writes Write miss P 1 intent to write Read miss S P 1 intent to write I 38

Observation Other processor reads P 1 writes back M P 1 reads or writes Write miss Other processor intent to write Read miss Read by any processor S Other processor intent to write I If a line is in the M state then no other cache can have a copy of the line! Memory stays coherent, multiple differing copies cannot exist 39

MESI: An Enhanced MSI protocol increased performance for private data Each cache line has a tag state bits Address tag Write miss P 1 write or read Other processor reads P 1 writes back Read miss, shared Read by any processor M S P 1 intent to write P 1 write Other processor intent to write M: Modified Exclusive E: Exclusive but unmodified S: Shared I: Invalid Other processor reads E Other processor intent to write, P1 writes back I P 1 read Read miss, not shared Other processor intent to write Cache state in processor P 1 40

Optimized Snoop with Level-2 Caches CPU CPU CPU CPU L1 $ L1 $ L1 $ L1 $ L2 $ L2 $ L2 $ L2 $ Snooper Snooper Snooper Snooper Processors often have two-level caches small L1, large L2 (usually both on chip now) Inclusion property: entries in L1 must be in L2 invalidation in L2 invalidation in L1 Snooping on L2 does not affect CPU-L1 bandwidth What problem could occur? 41

Intervention CPU-1 CPU-2 A 200 cache-1 CPU-Memory bus A 100 cache-2 memory (stale data) When a read-miss for A occurs in cache-2, a read request for A is placed on the bus Cache-1 needs to supply & change its state to shared The memory may respond to the request also! Does memory know it has stale data? Cache-1 needs to intervene through memory controller to supply correct data to cache-2 42

False Sharing state blk addr data0 data1... datan A cache block contains more than one word Cache-coherence is done at the block-level and not word-level Suppose M 1 writes word i and M 2 writes word k and both words have the same block address. What can happen? 43

Synchronization and Caches: Performance Issues Processor 1 R 1 L: swap (mutex), R; if<r>then goto L; <critical section> M[mutex] 0; Processor 2 R 1 L: swap (mutex), R; if<r>then goto L; <critical section> M[mutex] 0; Processor 3 R 1 L: swap (mutex), R; if<r>then goto L; <critical section> M[mutex] 0; cache mutex=1 cache CPU-Memory Bus cache Cache-coherence protocols will cause mutex to ping-pong between P1 s and P2 s caches. Ping-ponging can be reduced by first reading the mutex location (non-atomically) and executing a swap only if it is found to be zero. 44

Load-reserve & Store-conditional Special register(s) to hold reservation flag and address, and the outcome of store-conditional Load-reserve R, (a): <flag, adr> <1, a>; R M[a]; Store-conditional (a), R: if<flag, adr> == <1, a> then cancel other procs reservation on a; M[a] <R>; status succeed; else status fail; If the snooper sees a store transaction to the address in the reserve register, the reserve bit is set to 0 Several processors may reserve a simultaneously These instructions are like ordinary loads and stores with respect to the bus traffic Can implement reservation by using cache hit/miss, no additional hardware required (problems?) 45

Out-of-Order Loads/Stores & CC snooper Wb-req, Inv-req, Inv-rep CPU load/store buffers Cache pushout (Wb-rep) Memory (I/S/E) (S-rep, E-rep) (S-req, E-req) Blocking caches One request at a time + CC SC Non-blocking caches Multiple requests (different addresses) concurrently + CC Relaxed memory models CC ensures that all processors observe the same order of loads and stores to an address CPU/Memory Interface 46

MESI: An Enhanced MSI protocol increased performance for private data Each cache line has a tag state bits Address tag Write miss P 1 write or read Other processor reads P 1 writes back Read miss, shared Read by any processor M S P 1 intent to write P 1 write Other processor intent to write M: Modified Exclusive E: Exclusive but unmodified S: Shared I: Invalid Other processor reads E Other processor intent to write, P1 writes back I P 1 read Read miss, not shared Other processor intent to write Cache state in processor P 1 47

Performance of Symmetric Shared-Memory Multiprocessors Cache performance is combination of: 1. Uniprocessor cache miss traffic 2. Traffic caused by communication Results in invalidations and subsequent cache misses Adds 4 th C: coherence miss Joins Compulsory, Capacity, Conflict (Sometimes called a Communication miss) 48

Coherency Misses 1. True sharing misses arise from the communication of data through the cache coherence mechanism Invalidates due to 1 st write to shared block Reads by another CPU of modified block in different cache Miss would still occur if block size were 1 word 2. False sharing misses when a block is invalidated because some word in the block, other than the one being read, is written into Invalidation does not cause a new value to be communicated, but only causes an extra cache miss Block is shared, but no word in block is actually shared miss would not occur if block size were 1 word 49

Example: True v. False Sharing v. Hit? Assume x1 and x2 in same cache block. P1 and P2 both read x1 and x2 before. Time P1 P2 True, False, Hit? Why? 1 Write x1 2 Read x2 3 Write x1 4 Write x2 5 Read x2 True miss; invalidate x1 in P2 False miss; x1 irrelevant to P2 False miss; x1 irrelevant to P2 False miss; x1 irrelevant to P2 True miss; invalidate x2 in P1 50

A Cache-Coherent System Must: Provide set of states, state transition diagram, and actions Manage coherence protocol (0) Determine when to invoke coherence protocol (a) Find info about state of address in other caches to determine action» whether need to communicate with other cached copies (b) Locate the other copies (c) Communicate with those copies (invalidate/update) (0) is done the same way on all systems state of the line is maintained in the cache protocol is invoked if an access fault occurs on the line Different approaches distinguished by (a) to (c) 51

Bus-based Coherence All of (a), (b), (c) done through broadcast on bus faulting processor sends out a search others respond to the search probe and take necessary action Could do it in scalable network too broadcast to all processors, and let them respond Conceptually simple, but broadcast doesn t scale with number of processors, P on bus, bus bandwidth doesn t scale on scalable network, every fault leads to at least P network transactions Scalable coherence: can have same cache states and state transition diagram different mechanisms to manage protocol 52

Scalable Approach: Directories Every memory block has associated directory information keeps track of copies of cached blocks and their states on a miss, find directory entry, look it up, and communicate only with the nodes that have copies if necessary in scalable networks, communication with directory and copies is through network transactions Many alternatives for organizing directory information 53

Basic Operation of Directory k processors. With each cache-block in memory: k presence-bits, 1 dirty-bit With each cache-block in cache: 1 valid bit, and 1 dirty (owner) bit Read from main memory by processor i: If dirty-bit OFF then { read from main memory; turn p[i] ON; } if dirty-bit ON then { recall line from dirty proc (downgrade cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;} Write to main memory by processor i: If dirty-bit OFF then {send invalidations to all caches that have the block; turn dirty-bit ON; supply data to i; turn p[i] ON;... } 54

Directory Cache Protocol CPU CPU CPU CPU CPU CPU Cache Cache Cache Cache Cache Cache Interconnection Network Directory Controller Directory Controller Directory Controller Directory Controller DRAM Bank DRAM Bank DRAM Bank DRAM Bank Assumptions: Reliable network, FIFO message delivery between any given source-destination pair 55

Cache States For each cache line, there are 4 possible states: C-invalid (= Nothing): The accessed data is not resident in the cache. C-shared (= Sh): The accessed data is resident in the cache, and possibly also cached at other sites. The data in memory is valid. C-modified (= Ex): The accessed data is exclusively resident in this cache, and has been modified. Memory does not have the most up-to-date data. C-transient (= Pending): The accessed data is in a transient state (for example, the site has just issued a protocol request, but has not received the corresponding protocol reply). 56

Home directory states For each memory block, there are 4 possible states: R(dir): The memory block is shared by the sites specified in dir (dir is a set of sites). The data in memory is valid in this state. If dir is empty (i.e., dir = ε), the memory block is not cached by any site. W(id): The memory block is exclusively cached at site id, and has been modified at that site. Memory does not have the most up-to-date data. TR(dir): The memory block is in a transient state waiting for the acknowledgements to the invalidation requests that the home site has issued. TW(id): The memory block is in a transient state waiting for a block exclusively cached at site id (i.e., in C-modified state) to make the memory block at the home site up-todate. 57

Protocol Messages There are 10 different protocol messages: Category Cache to Memory Requests Messages ShReq, ExReq Memory to Cache Requests Cache to Memory Responses WbReq, InvReq, FlushReq WbRep(v), InvRep, FlushRep(v) Memory to Cache Responses ShRep(v), ExRep(v) 58

Coherence Rules Writes eventually become visible to all processors Writes to the same location are serialized 66

Snoopy Coherence Protocols Bus provides serialization point Broadcast, totally ordered Each cache controller snoops all bus transactions Controller updates state of cache in response to processor and snoop events and generates bus transactions Snoopy protocol (FSM) State-transition diagram Actions Handling writes Write-invalidate Write-update 67

Directory Taxonomy & Scalability Duplicate tags Full-map Sparse Full-bit-vectors Coarse-grain bit-vectors Limited-pointers In-cache Hierarchical sparse 68

Directory-Based Coherence Route all coherence transactions through a directory Tracks contents of private caches -> No broadcasts Serves as ordering point for conflicting requests -> Unordered networks 69

Coherence & Consistency Shared memory systems Have multiple private caches for performance reasons Need to provide the illusion of a single shared memory Intuition: A read should return the most recently written value What is most recent? Formally Coherence: What values can a read return?» Concerns reads/writes to a single memory location Consistency: When do writes become visible to reads?» Concerns reads/writes to multiple memory locations 70

Acknowledgements These slides contain material developed and copyright by: Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) MIT material derived from course 6.823 UCB material derived from course CS252 71