The MESI State Transition Graph

Size: px

Start display at page:

Download "The MESI State Transition Graph"

Roger Hines
5 years ago
Views:

1 Small-scale shared memory multiprocessors Semantics of the shared address space model (Ch ) Design of the M(O)ESI snoopy protocol Design of the Dragon snoopy protocol Performance issues Synchronization primitives and algorithms 2/10/2009 slide 1 The MESI State Transition Graph Also called the Illinois protocol (Papamarcos and Patel 1984) The E-state saves a bus transaction when there is no sharing (a quite common case) 2/10/2009 slide 2 1

2 MESI Protocol (cont d) Requires a shared bus line (wired OR) One sharer could supply the block instead of memory Requires an extra state: one of the sharers owns the block MOESI protocol supports this CC and SC are satisfied in the same way as MSI 2/10/2009 slide 3 A Write-Update Protocol State transition diagram for the Dragon protocol 2/10/2009 slide 4 2

3 Tradeoffs between Write- Update and Write-Invalidate A write-run: a sequence of memory operations from a single processor to a block delimited by an access from another Example: W2, R1, W1, W1, R1, W1, R3 Length(write-run) = L = # writes in the run (L=3 in example) Write-invalidate: a write run of any length results in one coherence miss Write-update: a write run of length L results in L updates 2/10/2009 slide 5 Write-run run Statistics Write-runs are typically evenly distributed across lengths of 1-9 For PTHOR they are quite short; for the others there is no clear peak Write-update results in high traffic peaks for these applications 2/10/2009 slide 6 3

4 Performance Issues Important performance issues Average memory access time is affected by number of misses communication time associated with each miss Impact of design decisions such as various protocol optimizations block size Let s focus on impact of block size on miss rate and traffic 2/10/2009 slide 7 4C Model of Cache Misses Recall 4C-model of misses for write-invalidate protocols Compulsory (or cold) misses Capacity misses Conflict misses Coherence (or communication) misses 2/10/2009 slide 8 4

5 Impact of Block Size Increasing the block size can reduce miss rate if spatial locality is good increase miss rate due to false sharing if spatial locality not good increase miss rate due to capacity or address mapping constraints increase traffic due to fetching of unnecessary data (mismatch fetch/access size, false sharing) increase miss penalty and perhaps hit cost 2/10/2009 slide 9 Classification of Misses 2/10/2009 slide 10 5

6 Example 1 Assume A and B belong to same memory block P1 P2 Read A (cold miss) Read B (cold miss) Write A (invalidates P2 s block) Read B (true sharing miss) Read A Write B (invalidates P2 s block) Classification is done when the block is evicted 2/10/2009 slide 11 Example 2 Assume A and B belong to same memory block P1 Read A Write A Write B P2 (cold miss) Read B (cold miss) Read C (evict B) Read B (replacement miss) Read C (evict B) 2/10/2009 slide 12 6

7 Parallel Computer Organization and Design : Lecture 5 Impact on Miss Rate Upgrade Upgrade 0.5 False sharing True sharing 10 False sharing True sharing Capacity Capacity 0.4 Cold 8 Cold Miss rate (%) 0.3 Miss rate (%) Barnes/8 Barnes/16 Barnes/32 Barnes/64 Barnes/128 Barnes/256 Lu/8 Lu/16 Lu/32 Lu/64 Lu/128 Lu/256 Radiosity/8 Radiosity/16 Radiosity/32 Radiosity/64 Radiosity/128 Radiosity/256 Ocean/8 Ocean/16 Ocean/32 Ocean/64 Ocean/128 Ocean/256 Radix/8 Radix/16 Radix/32 Radix/64 Radix/128 Radix/256 Raytrace/8 Raytrace/16 Raytrace/32 Raytrace/64 Raytrace/128 Raytrace/256 False sharing misses typically increase with increased block size 2/10/2009 slide 13 Impact on Traffic Traffic increases significantly with block size 2/10/2009 slide 14 7

8 Synchronization Primitives Important part of the communication architecture with a rich set of tradeoffs across many layers Components of a synchronization event acquire method waiting algorithm (spin or block) release method Our focus mutual exclusion global (barrier) & event synchronization 2/10/2009 slide 15 Lock Primitives Support an atomic test&set in the ISA semantics: reads and modifies a location atomically lock: t&s register, location bnz lock /* if not 0, try again */ ret /* return control to caller */ unlock: st location, #0 /* write 0 to location */ ret /* return control to caller */ Interacts poorly with a write-invalidate protocol 2/10/2009 slide 16 8

9 Test&test&set lock: ld register, location /* load into register */ bnz lock /* if not 0, try again */ t&s register, location bnz lock /* if not 0, try again */ ret /* return control to caller */ unlock: st location, #0 /* write 0 to location */ ret /* return control to caller */ Spinning is done by memory reads: no coherence traffic on busy waiting 2/10/2009 slide 17 Fetch&Op and Comp&Swap Semantics: Fetch&op (location) { temp = *location; *location = operation(location); return temp } Fetch&increment add 1 to location Fetch&add requires an additional operand Compare&Swap if the value is the same as first operand, swap with the value of the second operand 2/10/2009 slide 18 9

10 Performance Goals for Locks Low uncontended latency Low traffic e.g., t&s generates invalidations Low storage requirements small and independent of number of processors Fairness Unfair locking may cause starvation 2/10/2009 slide 19 Load-Locked Locked Store Conditional lock: ll reg1, loc /* load locked the loc into reg. */ bnz reg1, lock /* if loc was locked, try again */ sc loc, reg2 /* store reg2 conditionally into loc */ beqz lock /* if store conditional failed, start again*/ ret /* return control to caller */ unlock: st loc, #0 /* write 0 to location */ ret /* return control to caller */ One cansynthesizea family of synchronization primitives Low uncontended latency No coherence traffic if it fails 2/10/2009 slide 20 10

11 Advanced Locking Algorithms Previous locking schemes make processors compete and are inherently unfair Ticket locks Get a ticket number through fetch&incr Get exclusive access when now-serving # equals ticket # Lock is released by increasing now-serving # Can cause much read traffic Array-based lock Use as many positions as there are processes Spin wait on the designated position 2/10/2009 slide 21 Point-to to-point Event Synchronization P1 P2 a =f(x) while (flag is 0) do nothing; flag = 1 b = g(a); /* use a */ flag is used to signal an event Full/empty bits: delegate spin-waiting and flag operations to the architecture level P1 P2 a =f(x) /* set a */ b =g(a) /* use a */ Has not been very popular because of cost and inflexibility 2/10/2009 slide 22 11

12 Barrier Synchronization Can be implemented in software based on locks struct bar_type { int counter; struct lock_type lock; int flag = 0; } bar_name; BARRIER (bar_name, p) { LOCK(bar_name.lock); if (bar_name.counter == 0) bar_name.flag = 0; /* reset flag if first to reach*/ mycount = bar_name.counter++; /* mycount is private */ UNLOCK(bar_name.lock); if (mycount == p) { /* last to arrive */ bar_name.counter = 0; /* reset for next barrier */ bar_name.flag = 1; /* release waiters */ } else while (bar_name.flag == 0) {}; /* busy wait for release */ Exercise: There is a subtle problem with this implementation 2/10/2009 slide 23 12

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout CS 28 Parallel Computer Architecture Lecture 23 Hardware-Software Trade-offs in Synchronization and Data Layout April 21, 2008 Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs28 Role of