CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Size: px

Start display at page:

Download "CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II"

Owen Hunter
5 years ago
Views:

1 CS 252 Graduate Computer Architecture Lecture 11: Multiprocessors-II Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley Recap: Sequential Consistency A Memory Model P P P P P P M A system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program Leslie Lamport Sequential Consistency = arbitrary order-preserving interleaving of memory references of sequential programs 10/16/2007 2

2 Recap: Sequential Consistency Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( ) What are these in our example? T1: T2: Store (X), 1 (X = 1) Load R 1, (Y) Store (Y), 11 (Y = 11) Store (Y ), R 1 (Y = Y) Load R 2, (X) Store (X ), R 2 (X = X) additional SC requirements 10/16/ Recap: Mutual Exclusion and Locks Want to guarantee only one process is active in a critical section Blocking atomic read-modify-write instructions e.g., Test&Set, Fetch&Add, Swap vs Non-blocking atomic read-modify-write instructions e.g., Compare&Swap, Load-reserve/Store-conditional vs Protocols based on ordinary Loads and Stores 10/16/2007 4

3 Issues in Implementing Sequential Consistency P P P P P P Implementation of SC is complicated by two issues Out-of-order execution capability Load(a); Load(b) yes Load(a); Store(b) yes if a! b Store(a); Load(b) yes if a! b Store(a); Store(b) yes if a! b M Caches Caches can prevent the effect of a store from being seen by other processors SC complications motivates architects to consider weak or relaxed memory models 10/16/ Memory Fences Instructions to sequentialize memory accesses Processors with relaxed or weak memory models (i.e., permit Loads and Stores to different addresses to be reordered) need to provide memory fence instructions to force the serialization of memory accesses Examples of processors with relaxed memory models: Sparc V8 (TSO,PSO): Membar Sparc V9 (RMO): Membar #LoadLoad, Membar #LoadStore Membar #StoreLoad, Membar #StoreStore PowerPC (WO): Sync, EIEIO Memory fences are expensive operations, however, one pays the cost of serialization only when it is required 10/16/2007 6

4 Using Memory Fences Producer tail head Consumer R tail R tail R head R Producer posting Item x: Load R tail, (tail) Store (R tail ), x Membar SS R tail =R tail +1 Store (tail), R tail ensures that tail ptr is not updated before x has been stored ensures that R is not loaded before x has been stored Consumer: Load R head, (head) spin: Load R tail, (tail) if R head ==R tail goto spin Membar LL Load R, (R head ) R head =R head +1 Store (head), R head process(r) 10/16/ Data-Race Free Programs a.k.a. Properly Synchronized Programs Process 1... Acquire(mutex); Release(mutex); Process 2... Acquire(mutex); Release(mutex); Synchronization variables (e.g. mutex) are disjoint from data variables Accesses to writable shared data variables are protected in critical regions " no data races except for locks (Formal definition is elusive) In general, it cannot be proven if a program is data-race free. 10/16/2007 8

5 Fences in Data-Race Free Programs Process 1... Acquire(mutex); membar; membar; Release(mutex); Process 2... Acquire(mutex); membar; membar; Release(mutex); Relaxed memory model allows reordering of instructions by the compiler or the processor as long as the reordering is not done across a fence The processor also should not speculate or prefetch across fences 10/16/ Mutual Exclusion Using Load/Store A protocol based on two shared variables c1 and c2. Initially, both c1 and c2 are 0 (not busy) Process 1... c1=1; L: if c2=1 then go to L c1=0; Process 2... c2=1; L: if c1=1 then go to L c2=0; What is wrong? 10/16/

6 Mutual Exclusion: second attempt To avoid deadlock, let a process give up the reservation (i.e. Process 1 sets c1 to 0) while waiting. Process 1... L: c1=1; if c2=1 then { c1=0; go to L} c1=0 Process 2... L: c2=1; if c1=1 then { c2=0; go to L} c2=0 What can go wrong now? Deadlock is not possible but with a low probability a livelock may occur An unlucky process may never get to enter the critical section " starvation 10/16/ A Protocol for Mutual Exclusion T. Dekker, 1966 A protocol based on 3 shared variables c1, c2 and turn. Initially, both c1 and c2 are 0 (not busy) Process 1... c1=1; turn = 1; L: if c2=1 & turn=1 then go to L c1=0; Process 2... c2=1; turn = 2; L: if c1=1 & turn=2 then go to L c2=0; turn = i ensures that only process i can wait variables c1 and c2 ensure mutual exclusion Solution for n processes was given by Dijkstra and is quite tricky! 10/16/

7 Analysis of Dekker s Algorithm Scenario 1... Process 1 c1=1; turn = 1; L: if c2=1 & turn=1 then go to L c1=0;... Process 2 c2=1; turn = 2; L: if c1=1 & turn=2 then go to L c2=0; Scenario 2... Process 1 c1=1; turn = 1; L: if c2=1 & turn=1 then go to L c1=0;... Process 2 c2=1; turn = 2; L: if c1=1 & turn=2 then go to L c2=0; 10/16/ N-process Mutual Exclusion Lamport s Bakery Algorithm Process i Initially num[j] = 0, for all j Entry Code choosing[i] = 1; num[i] = max(num[0],, num[n-1]) + 1; choosing[i] = 0; for(j = 0; j < N; j++) { while( choosing[j] ); while( num[j] && ( ( num[j] < num[i] ) ( num[j] == num[i] && j < i ) ) ); } Exit Code num[i] = 0; 10/16/

8 CS252 Administrivia Project meetings next week (10/23-25), same schedule as before (M 1-3PM, Tu/Th 9:40-11AM) Schedule on website All in 645 Soda, 20mins/group Hope to see: Project web site At least one initial result (some delta from hello world ) Grasp of related work Midterm review 10/16/ CS252 Midterm 1 Problem 1 Distribution (22 points total) % of students Scores 10/16/

9 CS252 Midterm 1 Problem 2 (19 points total) % of students Score Range 10/16/ % of students CS252 Midterm 1 Problem 3 (19 points total) Score Range 10/16/

10 CS252 Midterm 1 Problem 4 (20 points total) % of students Score Range 10/16/ CS252 Midterm 1 (80 points total) Average: 39.8 Median: % of Students Score Range 10/16/

11 EECS Graduate Grading Guidelines A+, A, A- Quality expected from PhD student B+, B Quality expected from MS student, not PhD <= B- < Quality expected from MS student Class average somewhere in range /16/ Memory Consistency in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has a stale value Do these stale values matter? What is the view of shared memory for programming? 10/16/

12 Write-back Caches & SC T1 is executed prog T1 ST X, 1 ST Y,11 cache-1 writes back Y T2 executed cache-1 writes back X cache-1 X= 1 Y=11 X= 1 Y=11 X= 1 Y=11 X= 1 Y=11 memory X = 0 Y =10 X = Y = X = 0 Y =11 X = Y = X = 0 Y =11 X = Y = X = 1 Y =11 X = Y = cache-2 Y = Y = X = X = Y = Y = X = X = Y = 11 Y = 11 X = 0 X = 0 Y = 11 Y = 11 X = 0 X = 0 prog T2 LD Y, R1 ST Y, R1 LD X, R2 ST X,R2 X= 1 X = 1 Y =11 Y=11 Y =11 Y =11 X = 0 X = 0 X & Y 10/16/2007 Y =11 X = 0 23 cache-2 writes back inconsistent Write-through Caches & SC prog T1 ST X, 1 ST Y,11 cache-1 X= 0 Y=10 memory X = 0 Y =10 X = Y = cache-2 Y = Y = X = 0 X = prog T2 LD Y, R1 ST Y, R1 LD X, R2 ST X,R2 T1 executed X= 1 Y=11 X = 1 Y =11 X = Y = Y = Y = X = 0 X = T2 executed X= 1 Y=11 X = 1 Y =11 X = 0 Y =11 Y = 11 Y = 11 X = 0 X = 0 Write-through caches don t preserve sequential consistency either 10/16/

13 Maintaining Sequential Consistency SC is sufficient for correct producer-consumer and mutual exclusion code (e.g., Dekker) Multiple copies of a location in various caches can cause SC to break down. Hardware support is required such that only one processor at a time has write permission for a location no processor can load a stale copy of the location after a write " cache coherence protocols 10/16/ Cache Coherence Protocols for SC write request: the address is invalidated (updated) in all other caches before (after) the write is performed read request: if a dirty copy is found in some cache, a writeback is performed before the memory is read We will focus on Invalidation protocols as opposed to Update protocols 10/16/

14 Warmup: Parallel I/O Proc. Address (A) Data (D) Cache Memory Bus Physical Memory R/W Either Cache or DMA can be the Bus Master and effect transfers A D R/W Page transfers occur while the Processor is running DMA DISK (DMA stands for Direct Memory Access) 10/16/ Problems with Parallel I/O Proc. Cached portions of page Cache Memory Bus Physical Memory DMA transfers DMA Memory DISK Disk: Physical memory may be stale if Cache copy is dirty Disk Memory: Cache may hold state data and not see memory writes 10/16/

15 Snoopy Cache Goodman 1983 Idea: Have cache watch (or snoop upon) DMA transfers, and then do the right thing Snoopy cache tags are dual-ported Used to drive Memory Bus when Cache is Bus Master Proc. A R/W D Tags and State Data (lines) A R/W Snoopy read port attached to Memory Bus Cache 10/16/ Snoopy Cache Actions for DMA Observed Bus Cycle Cache State Cache Action DMA Read Address not cached Cached, unmodified No action No action Memory Disk Cached, modified Address not cached Cache intervenes No action DMA Write Cached, unmodified Cache purges its copy Disk Memory Cached, modified??? 10/16/

16 Shared Memory Multiprocessor Memory Bus M 1 Snoopy Cache Physical Memory M 2 Snoopy Cache M 3 Snoopy Cache DMA DISKS Use snoopy mechanism to keep all processors view of memory coherent 10/16/ Cache State Transition Diagram The MSI protocol Each cache line has a tag state bits Read miss Read by any processor Address tag Other processor reads P 1 writes back S P 1 intent to write Other processor intent to write M: Modified S: Shared I: Invalid 10/16/ M I P 1 reads or writes Write miss Other processor intent to write Cache state in processor P 1

17 Two Processor Example (Reading and writing the same cache line) P 1 reads P 1 writes P 2 reads P 2 writes P 1 reads P 1 writes P 2 writes P 1 writes P 1 Read miss P 2 reads, P 1 writes back S P 1 intent to write P 2 intent to write M I P 1 reads or writes Write miss P 2 intent to write P 2 Read miss P 1 reads, P 2 writes back S P 2 intent to write P 1 intent to write 10/16/ M I P 2 reads or writes Write miss P 1 intent to write Observation Read miss Read by any processor Other processor reads P 1 writes back S P 1 intent to write Other processor intent to write M I P 1 reads or writes Write miss Other processor intent to write If a line is in the M state then no other cache can have a copy of the line! Memory stays coherent, multiple differing copies cannot exist 10/16/

18 MESI: An Enhanced MSI protocol increased performance for private data Each cache line has a tag state bits P 1 write or read Address tag M Other processor reads P 1 writes back Read miss, shared Read by any processor S P 1 write P 1 intent to write Other processor intent to write M: Modified Exclusive E: Exclusive, unmodified S: Shared I: Invalid 10/16/ E I P 1 read Read miss, not shared Write miss Other processor intent to write Cache state in processor P 1 Optimized Snoop with Level-2 Caches CPU CPU CPU CPU L1 $ L1 $ L1 $ L1 $ L2 $ L2 $ L2 $ L2 $ Snooper Snooper Snooper Snooper Processors often have two-level caches small L1, large L2 (usually both on chip now) Inclusion property: entries in L1 must be in L2 invalidation in L2 " invalidation in L1 Snooping on L2 does not affect CPU-L1 bandwidth What problem could occur? 10/16/

19 Intervention CPU-1 CPU-2 A 200 cache-1 CPU-Memory bus A 100 cache-2 memory (stale data) When a read-miss for A occurs in cache-2, a read request for A is placed on the bus Cache-1 needs to supply & change its state to shared The memory may respond to the request also! Does memory know it has stale data? Cache-1 needs to intervene through memory controller to supply correct data to cache-2 10/16/ False Sharing state blk addr data0 data1... datan A cache block contains more than one word Cache-coherence is done at the block-level and not word-level Suppose M 1 writes word i and M 2 writes word k and both words have the same block address. What can happen? 10/16/

20 Synchronization and Caches: Performance Issues Processor 1 R # 1 L: swap (mutex), R; if <R> then goto L; <critical section> M[mutex] # 0; Processor 2 R # 1 L: swap (mutex), R; if <R> then goto L; <critical section> M[mutex] # 0; Processor 3 R # 1 L: swap (mutex), R; if <R> then goto L; <critical section> M[mutex] # 0; cache mutex=1 cache CPU-Memory Bus cache Cache-coherence protocols will cause mutex to ping-pong between P1 s and P2 s caches. Ping-ponging can be reduced by first reading the mutex location (non-atomically) and executing a swap only if it is found to be zero. 10/16/ Performance Related to Bus Occupancy In general, a read-modify-write instruction requires two memory (bus) operations without intervening memory operations by other processors In a multiprocessor setting, bus needs to be locked for the entire duration of the atomic read and write operation " expensive for simple buses " very expensive for split-transaction buses modern ISAs use load-reserve store-conditional 10/16/

21 Load-reserve & Store-conditional Special register(s) to hold reservation flag and address, and the outcome of store-conditional Load-reserve R, (a): <flag, adr> # <1, a>; R # M[a]; Store-conditional (a), R: if <flag, adr> == <1, a> then cancel other procs reservation on a; M[a] # <R>; status # succeed; else status # fail; If the snooper sees a store transaction to the address in the reserve register, the reserve bit is set to 0 Several processors may reserve a simultaneously These instructions are like ordinary loads and stores with respect to the bus traffic Can implement reservation by using cache hit/miss, no additional hardware required (problems?) 10/16/ Performance: Load-reserve & Store-conditional The total number of memory (bus) transactions is not necessarily reduced, but splitting an atomic instruction into load-reserve & storeconditional: increases bus utilization (and reduces processor stall time), especially in splittransaction buses reduces cache ping-pong effect because processors trying to acquire a semaphore do not have to perform a store each time 10/16/

Non-blocking caches Multiple requests (different addresses) concurrently + CC " Relaxed memory models CC

22 Out-of-Order Loads/Stores & CC snooper Wb-req, Inv-req, Inv-rep CPU load/store buffers Cache pushout (Wb-rep) Memory (I/S/E) (S-rep, E-rep) (S-req, E-req) Blocking caches One request at a time + CC " SC Non-blocking caches Multiple requests (different addresses) concurrently + CC " Relaxed memory models CC ensures that all processors observe the same order of loads and stores to an address CPU/Memory Interface 10/16/

Page 1. Cache Coherence

Page 1. Cache Coherence Page 1 Cache Coherence 1 Page 2 Memory Consistency in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale