Computer Architecture and Parallel Computing 并行结构与计算. Lecture 6 Coherence Protocols

Computer Architecture and Parallel Computing 并行结构与计算 Lecture 6 Coherence Protocols Peng Liu ( 刘鹏 ) College of Information Science and Electronic Engineering Zhejiang University, Hangzhou 310027, China liupeng@zju.edu.cn 浙江大学信息与电子工程学院

Last time in Lecture 05 Superscalar Multithreading 2

Parallel Processing: Déjà vu all over again? today s processors are nearing an impasse as technologies approach the speed of light.. David Mitchell, The Transputer: The Time Is Now (1989) Transputer had bad timing (Uniprocessor performance ) Procrastination rewarded: 2X seq. perf. / 1.5 years We are dedicating all of our future product development to multicore designs. This is a sea change in computing Paul Otellini, President, Intel (2005) All microprocessor companies switch to MP (2+ CPUs/2 yrs) Procrastination penalized: 2X sequential perf. / 5 yrs Even handheld systems moving to multicore Pad 4, &iphone6 have two+ cores each Playstation Portable has 9 cores (cell) 3

Symmetric Multiprocessors Processor Processor CPU-Memory bus bridge Memory symmetric All memory is equally far away from all processors Any processor can do any I/O (set up a DMA transfer) I/O controller I/O bus I/O controller Graphics output I/O controller Networks 4

Synchronization The need for synchronization arises whenever there are concurrent processes in a system (even in a uniprocessor system) producer Producer-Consumer: A consumer process must wait until the producer process has produced data consumer Mutual Exclusion: Ensure that only one process uses a resource at a given time P1 P2 Shared Resource 5

A Producer-Consumer Example Producer tail head Consumer R tail R tail R head R Producer posting Item x: Load R tail, (tail) Store (R tail ), x R tail =R tail +1 Store (tail), R tail The program is written assuming instructions are executed in order. Consumer: Load R head, (head) spin: Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head =R head +1 Store (head), R head process(r) Problems? 6

Sequential Consistency A Memory Model P P P P P P M A system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program Leslie Lamport Sequential Consistency = arbitrary order-preserving interleaving of memory references of sequential programs 8

Sequential Consistency Sequential concurrent tasks: T1, T2 Shared variables: X, Y (initially X = 0, Y = 10) T1: T2: Store (X), 1 (X = 1) Load R 1, (Y) Store (Y), 11 (Y = 11) Store (Y ), R 1 (Y = Y) Load R 2, (X) Store (X ), R 2 (X = X) what are the legitimate answers for X and Y? (X,Y ) {(1,11), (0,10), (1,10), (0,11)}? If y is 11 then x cannot be 0 9

Sequential Consistency Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( ) What are these in our example? T1: T2: Store (X), 1 (X = 1) Load R 1, (Y) Store (Y), 11 (Y = 11) Store (Y ), R 1 (Y = Y) Load R 2, (X) additional SC requirements Store (X ), R 2 (X = X) Does (can) a system with caches or out-of-order execution capability provide a sequentiallyconsistent view of the memory? more on this later 10

Multiple Consumer Example Producer tail head Consumer 1 R head R tail R R tail Producer posting Item x: Load R tail, (tail) Store (R tail ), x R tail =R tail +1 Store (tail), R tail Critical section: Needs to be executed atomically by one consumer locks Consumer 2 R head R tail Consumer: Load R head, (head) spin: Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head =R head +1 Store (head), R head process(r) What is wrong with this code? R 11

Locks or Semaphores E. W. Dijkstra, 1965 A semaphore is a non-negative integer, with the following operations: P(s): if s>0, decrement s by 1, otherwise wait V(s): increment s by 1 and wake up one of the waiting processes P s and V s must be executed atomically, i.e., without interruptions or interleaved accesses to s by other processors Process i P(s) <critical section> V(s) initial value of s determines the maximum no. of processes in the critical section 12

Implementation of Semaphores Semaphores (mutual exclusion) can be implemented using ordinary Load and Store instructions in the Sequential Consistency memory model. However, protocols for mutual exclusion are difficult to design... Simpler solution: atomic read-modify-write instructions Examples: m is a memory location, R is a register Test&Set (m), R: R M[m]; if R==0 then M[m] 1; Fetch&Add (m), R V, R: R M[m]; M[m] R + R V ; Swap (m), R: R t M[m]; M[m] R; R R t ; 13

Multiple Consumers Example using the Test&Set Instruction P: Test&Set (mutex),r temp if (R temp!=0) goto P Load R head, (head) spin: Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head =R head +1 Store (head), R head V: Store (mutex),0 process(r) Critical Section Other atomic read-modify-write instructions (Swap, Fetch&Add, etc.) can also implement P s and V s What if the process stops or is swapped out while in the critical section? 14

Nonblocking Synchronization Compare&Swap(m), R t, R s : if (R t ==M[m]) then M[m]=R s ; R s =R t ; status success; else status fail; status is an implicit argument try: Load R head, (head) spin: Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R newhead = R head +1 Compare&Swap(head), R head, R newhead if (status==fail) goto try process(r) 15

Load-reserve & Store-conditional Special register(s) to hold reservation flag and address, and the outcome of store-conditional Load-reserve R, (m): <flag, adr> <1, m>; R M[m]; Store-conditional (m), R: if<flag, adr> == <1, m> then cancel other procs reservation on m; M[m] R; status succeed; else status fail; try: spin: Load-reserve R head, (head) Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head = R head + 1 Store-conditional (head), R head if (status==fail) goto try process(r) 16

Performance of Locks Blocking atomic read-modify-write instructions e.g., Test&Set, Fetch&Add, Swap vs Non-blocking atomic read-modify-write instructions e.g., Compare&Swap, Load-reserve/Store-conditional vs Protocols based on ordinary Loads and Stores Performance depends on several interacting factors: degree of contention, caches, out-of-order execution of Loads and Stores later... 17

Issues in Implementing Sequential Consistency P P P P P P Implementation of SC is complicated by two issues Out-of-order execution capability Load(a); Load(b) yes Load(a); Store(b) yes if a b Store(a); Load(b) yes if a b Store(a); Store(b) yes if a b M Caches Caches can prevent the effect of a store from being seen by other processors No common commercial architecture has a sequentially consistent memory model! 18

Memory Fences Instructions to sequentialize memory accesses Processors with relaxed or weak memory models (i.e., permit Loads and Stores to different addresses to be reordered) need to provide memory fence instructions to force the serialization of memory accesses Examples of processors with relaxed memory models: Sparc V8 (TSO,PSO): Membar Sparc V9 (RMO): Membar #LoadLoad, Membar #LoadStore Membar #StoreLoad, Membar #StoreStore PowerPC (WO): Sync, EIEIO Memory fences are expensive operations, however, one pays the cost of serialization only when it is required 19

Using Memory Fences Producer tail head Consumer R tail R tail R head R Producer posting Item x: Load R tail, (tail) Store (R tail ), x Membar SS R tail =R tail +1 Store (tail), R tail ensures that tail ptr is not updated before x has been stored ensures that R is not loaded before x has been stored Consumer: Load R head, (head) spin: Load R tail, (tail) if R head ==R tail goto spin Membar LL Load R, (R head ) R head =R head +1 Store (head), R head process(r) 20

Mutual Exclusion Using Load/Store A protocol based on two shared variables c1 and c2. Initially, both c1 and c2 are 0 (not busy) Process 1... c1=1; L: if c2=1 then go to L < critical section> c1=0; Process 2... c2=1; L: if c1=1 then go to L < critical section> c2=0; What is wrong? Deadlock! 21

Mutual Exclusion: second attempt To avoid deadlock, let a process give up the reservation (i.e., Process 1 sets c1 to 0) while waiting. Process 1... L: c1=1; if c2=1 then { c1=0; go to L} < critical section> c1=0 Process 2... L: c2=1; if c1=1 then { c2=0; go to L} < critical section> c2=0 Deadlock is not possible but with a low probability a livelock may occur. An unlucky process may never get to enter the critical section starvation 22

A Protocol for Mutual Exclusion T. Dekker, 1966 A protocol based on 3 shared variables c1, c2 and turn. Initially, both c1 and c2 are 0 (not busy) Process 1... c1=1; turn = 1; L: if c2=1 & turn=1 then go to L < critical section> c1=0; Process 2... c2=1; turn = 2; L: if c1=1 & turn=2 then go to L < critical section> c2=0; turn = i ensures that only process i can wait variables c1 and c2 ensure mutual exclusion Solution for n processes was given by Dijkstra and is quite tricky! 23

N-process Mutual Exclusion Lamport s Bakery Algorithm Process i Entry Code choosing[i] = 1; num[i] = max(num[0],, num[n-1]) + 1; choosing[i] = 0; for(j = 0; j < N; j++) { while( choosing[j] ); while( num[j] && ( ( num[j] < num[i] ) ( num[j] == num[i] && j < i ) ) ); } Exit Code num[i] = 0; Initiallynum[j] = 0, for all j 25

Relaxed Memory Model needs Fences Producer tail head Consumer R tail R tail R head R Producer posting Item x: Load R tail, (tail) Store (R tail ), x Membar SS R tail =R tail +1 Store (tail), R tail ensures that tail ptr is not updated before x has been stored ensures that R is not loaded before x has been stored Consumer: Load R head, (head) spin: Load R tail, (tail) if R head ==R tail goto spin Membar LL Load R, (R head ) R head =R head +1 Store (head), R head process(r) 26

Memory Coherence in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has a stale value Do these stale values matter? What is the view of shared memory for programming? 27

Write-back Caches & SC cache-1 memory cache-2 T1 is executed prog T1 ST X, 1 ST Y,11 cache-1 writes backy X= 1 Y=11 X= 1 Y=11 X = 0 Y =10 X = Y = X = 0 Y =11 X = Y = Y = Y = X = X = Y = Y = X = X = prog T2 LD Y, R1 ST Y, R1 LD X, R2 ST X,R2 T2 executed X= 1 Y=11 X = 0 Y =11 X = Y = Y = 11 Y = 11 X = 0 X = 0 cache-1 writes backx X= 1 Y=11 X = 1 Y =11 X = Y = Y = 11 Y = 11 X = 0 X = 0 cache-2 writes back X &Y X= 1 Y=11 X = 1 Y =11 X = 0 Y =11 Y =11 Y =11 X = 0 X = 0 28

Write-through Caches & SC prog T1 ST X, 1 ST Y,11 cache-1 X= 0 Y=10 memory X = 0 Y =10 X = Y = cache-2 Y = Y = X = 0 X = prog T2 LD Y, R1 ST Y, R1 LD X, R2 ST X,R2 T1 executed X= 1 Y=11 X = 1 Y =11 X = Y = Y = Y = X = 0 X = T2 executed X= 1 Y=11 X = 1 Y =11 X = 0 Y =11 Y = 11 Y = 11 X = 0 X = 0 Write-through caches don t preserve sequential consistency either 29

Cache Coherence vs. Memory Consistency A cache coherence protocol ensures that all writes by one processor are eventually visible to other processors, for one memory address i.e., updates are not lost A memory consistency model gives the rules on when a write by one processor can be observed by a read on another, across different addresses Equivalently, what values can be seen by a load A cache coherence protocol is not enough to ensure sequential consistency But if sequentially consistent, then caches must be coherent Combination of cache coherence protocol plus processor memory reorder buffer implements a given machine s memory consistency model 30

Maintaining Cache Coherence Hardware support is required such that only one processor at a time has write permission for a location no processor can load a stale copy of the location after a write cache coherence protocols 31

Warmup: Parallel I/O Proc. Address (A) Data (D) Cache Memory Bus Physical Memory R/W Either Cache or DMA can be the Bus Master and effect transfers A D R/W Page transfers occur while the Processor is running DMA DISK (DMA stands for Direct Memory Access, means the I/O device can read/write memory autonomous from the CPU) 32

Problems with Parallel I/O Proc. Cached portions of page Cache Memory Bus Physical Memory DMA transfers DMA Memory DISK Disk: Physical memory may be stale if cache copy is dirty Disk Memory: Cache may hold stale data and not see memory writes 33

Snoopy Cache Goodman 1983 Idea: Have cache watch (or snoop upon) DMA transfers, and then do the right thing Snoopy cache tags are dual-ported Used to drive Memory Bus when Cache is Bus Master Proc. A R/W D Tags and State Data (lines) A R/W Snoopy read port attached to Memory Bus Cache 34

Shared Memory Multiprocessor Memory Bus M 1 Snoopy Cache Physical Memory M 2 Snoopy Cache M 3 Snoopy Cache DMA DISKS Use snoopy mechanism to keep all processors view of memory coherent 35

Snoopy Cache Coherence Protocols write miss: the address is invalidated in all other caches before the write is performed read miss: if a dirty copy is found in some cache, a writeback is performed before the memory is read 36

Cache State Transition Diagram The MSI protocol Each cache line has state bits state bits Address tag Write miss (P1 gets line from memory) Other processor reads (P 1 writes back) M: Modified S: Shared I: Invalid M P 1 reads or writes Read miss (P1 gets line from memory) Read by any processor S Other processor intent to write I Other processor intent to write (P 1 writes back) Cache state in processor P 1 37

Two Processor Example (Reading and writing the same cache line) P 1 reads P 1 writes P 2 reads P 2 writes P 1 reads P 1 writes P 2 writes P 1 writes P 1 Read miss P 2 reads, P 1 writes back S P 2 intent to write M I P 1 reads or writes Write miss P 2 intent to write P 2 P 1 reads, P 2 writes back M P 2 reads or writes Write miss P 1 intent to write Read miss S P 1 intent to write I 38

Observation Other processor reads P 1 writes back M P 1 reads or writes Write miss Other processor intent to write Read miss Read by any processor S Other processor intent to write I If a line is in the M state then no other cache can have a copy of the line! Memory stays coherent, multiple differing copies cannot exist 39

MESI: An Enhanced MSI protocol increased performance for private data Each cache line has a tag state bits Address tag Write miss P 1 write or read Other processor reads P 1 writes back Read miss, shared Read by any processor M S P 1 intent to write P 1 write Other processor intent to write M: Modified Exclusive E: Exclusive but unmodified S: Shared I: Invalid Other processor reads E Other processor intent to write, P1 writes back I P 1 read Read miss, not shared Other processor intent to write Cache state in processor P 1 40

Optimized Snoop with Level-2 Caches CPU CPU CPU CPU L1 $ L1 $ L1 $ L1 $ L2 $ L2 $ L2 $ L2 $ Snooper Snooper Snooper Snooper Processors often have two-level caches small L1, large L2 (usually both on chip now) Inclusion property: entries in L1 must be in L2 invalidation in L2 invalidation in L1 Snooping on L2 does not affect CPU-L1 bandwidth What problem could occur? 41

Intervention CPU-1 CPU-2 A 200 cache-1 CPU-Memory bus A 100 cache-2 memory (stale data) When a read-miss for A occurs in cache-2, a read request for A is placed on the bus Cache-1 needs to supply & change its state to shared The memory may respond to the request also! Does memory know it has stale data? Cache-1 needs to intervene through memory controller to supply correct data to cache-2 42

False Sharing state blk addr data0 data1... datan A cache block contains more than one word Cache-coherence is done at the block-level and not word-level Suppose M 1 writes word i and M 2 writes word k and both words have the same block address. What can happen? 43

Synchronization and Caches: Performance Issues Processor 1 R 1 L: swap (mutex), R; if<r>then goto L; <critical section> M[mutex] 0; Processor 2 R 1 L: swap (mutex), R; if<r>then goto L; <critical section> M[mutex] 0; Processor 3 R 1 L: swap (mutex), R; if<r>then goto L; <critical section> M[mutex] 0; cache mutex=1 cache CPU-Memory Bus cache Cache-coherence protocols will cause mutex to ping-pong between P1 s and P2 s caches. Ping-ponging can be reduced by first reading the mutex location (non-atomically) and executing a swap only if it is found to be zero. 44

Load-reserve & Store-conditional Special register(s) to hold reservation flag and address, and the outcome of store-conditional Load-reserve R, (a): <flag, adr> <1, a>; R M[a]; Store-conditional (a), R: if<flag, adr> == <1, a> then cancel other procs reservation on a; M[a] <R>; status succeed; else status fail; If the snooper sees a store transaction to the address in the reserve register, the reserve bit is set to 0 Several processors may reserve a simultaneously These instructions are like ordinary loads and stores with respect to the bus traffic Can implement reservation by using cache hit/miss, no additional hardware required (problems?) 45

Out-of-Order Loads/Stores & CC snooper Wb-req, Inv-req, Inv-rep CPU load/store buffers Cache pushout (Wb-rep) Memory (I/S/E) (S-rep, E-rep) (S-req, E-req) Blocking caches One request at a time + CC SC Non-blocking caches Multiple requests (different addresses) concurrently + CC Relaxed memory models CC ensures that all processors observe the same order of loads and stores to an address CPU/Memory Interface 46

Performance of Symmetric Shared-Memory Multiprocessors Cache performance is combination of: 1. Uniprocessor cache miss traffic 2. Traffic caused by communication Results in invalidations and subsequent cache misses Adds 4 th C: coherence miss Joins Compulsory, Capacity, Conflict (Sometimes called a Communication miss) 48

Coherency Misses 1. True sharing misses arise from the communication of data through the cache coherence mechanism Invalidates due to 1 st write to shared block Reads by another CPU of modified block in different cache Miss would still occur if block size were 1 word 2. False sharing misses when a block is invalidated because some word in the block, other than the one being read, is written into Invalidation does not cause a new value to be communicated, but only causes an extra cache miss Block is shared, but no word in block is actually shared miss would not occur if block size were 1 word 49

Example: True v. False Sharing v. Hit? Assume x1 and x2 in same cache block. P1 and P2 both read x1 and x2 before. Time P1 P2 True, False, Hit? Why? 1 Write x1 2 Read x2 3 Write x1 4 Write x2 5 Read x2 True miss; invalidate x1 in P2 False miss; x1 irrelevant to P2 False miss; x1 irrelevant to P2 False miss; x1 irrelevant to P2 True miss; invalidate x2 in P1 50

A Cache-Coherent System Must: Provide set of states, state transition diagram, and actions Manage coherence protocol (0) Determine when to invoke coherence protocol (a) Find info about state of address in other caches to determine action» whether need to communicate with other cached copies (b) Locate the other copies (c) Communicate with those copies (invalidate/update) (0) is done the same way on all systems state of the line is maintained in the cache protocol is invoked if an access fault occurs on the line Different approaches distinguished by (a) to (c) 51

Bus-based Coherence All of (a), (b), (c) done through broadcast on bus faulting processor sends out a search others respond to the search probe and take necessary action Could do it in scalable network too broadcast to all processors, and let them respond Conceptually simple, but broadcast doesn t scale with number of processors, P on bus, bus bandwidth doesn t scale on scalable network, every fault leads to at least P network transactions Scalable coherence: can have same cache states and state transition diagram different mechanisms to manage protocol 52

Scalable Approach: Directories Every memory block has associated directory information keeps track of copies of cached blocks and their states on a miss, find directory entry, look it up, and communicate only with the nodes that have copies if necessary in scalable networks, communication with directory and copies is through network transactions Many alternatives for organizing directory information 53

Basic Operation of Directory k processors. With each cache-block in memory: k presence-bits, 1 dirty-bit With each cache-block in cache: 1 valid bit, and 1 dirty (owner) bit Read from main memory by processor i: If dirty-bit OFF then { read from main memory; turn p[i] ON; } if dirty-bit ON then { recall line from dirty proc (downgrade cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;} Write to main memory by processor i: If dirty-bit OFF then {send invalidations to all caches that have the block; turn dirty-bit ON; supply data to i; turn p[i] ON;... } 54

Directory Cache Protocol CPU CPU CPU CPU CPU CPU Cache Cache Cache Cache Cache Cache Interconnection Network Directory Controller Directory Controller Directory Controller Directory Controller DRAM Bank DRAM Bank DRAM Bank DRAM Bank Assumptions: Reliable network, FIFO message delivery between any given source-destination pair 55

Cache States For each cache line, there are 4 possible states: C-invalid (= Nothing): The accessed data is not resident in the cache. C-shared (= Sh): The accessed data is resident in the cache, and possibly also cached at other sites. The data in memory is valid. C-modified (= Ex): The accessed data is exclusively resident in this cache, and has been modified. Memory does not have the most up-to-date data. C-transient (= Pending): The accessed data is in a transient state (for example, the site has just issued a protocol request, but has not received the corresponding protocol reply). 56

Home directory states For each memory block, there are 4 possible states: R(dir): The memory block is shared by the sites specified in dir (dir is a set of sites). The data in memory is valid in this state. If dir is empty (i.e., dir = ε), the memory block is not cached by any site. W(id): The memory block is exclusively cached at site id, and has been modified at that site. Memory does not have the most up-to-date data. TR(dir): The memory block is in a transient state waiting for the acknowledgements to the invalidation requests that the home site has issued. TW(id): The memory block is in a transient state waiting for a block exclusively cached at site id (i.e., in C-modified state) to make the memory block at the home site up-todate. 57

Protocol Messages There are 10 different protocol messages: Category Cache to Memory Requests Messages ShReq, ExReq Memory to Cache Requests Cache to Memory Responses WbReq, InvReq, FlushReq WbRep(v), InvRep, FlushRep(v) Memory to Cache Responses ShRep(v), ExRep(v) 58

Coherence Rules Writes eventually become visible to all processors Writes to the same location are serialized 66

Snoopy Coherence Protocols Bus provides serialization point Broadcast, totally ordered Each cache controller snoops all bus transactions Controller updates state of cache in response to processor and snoop events and generates bus transactions Snoopy protocol (FSM) State-transition diagram Actions Handling writes Write-invalidate Write-update 67

Directory Taxonomy & Scalability Duplicate tags Full-map Sparse Full-bit-vectors Coarse-grain bit-vectors Limited-pointers In-cache Hierarchical sparse 68

Directory-Based Coherence Route all coherence transactions through a directory Tracks contents of private caches -> No broadcasts Serves as ordering point for conflicting requests -> Unordered networks 69

Coherence & Consistency Shared memory systems Have multiple private caches for performance reasons Need to provide the illusion of a single shared memory Intuition: A read should return the most recently written value What is most recent? Formally Coherence: What values can a read return?» Concerns reads/writes to a single memory location Consistency: When do writes become visible to reads?» Concerns reads/writes to multiple memory locations 70

Acknowledgements These slides contain material developed and copyright by: Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) MIT material derived from course 6.823 UCB material derived from course CS252 71