Lecture 9 Outline. Lower-Level Protocol Choices. MESI (4-state) Invalidation Protocol. MESI: Processor-Initiated Transactions

Size: px

Start display at page:

Download "Lecture 9 Outline. Lower-Level Protocol Choices. MESI (4-state) Invalidation Protocol. MESI: Processor-Initiated Transactions"

Cuthbert Dorsey
6 years ago
Views:

1 Outline protocol Dragon updatebased protocol mpact of protocol optimizations LowerLevel Protocol Choices observed in state: what transition to make? Change to : assume ll read again soon good for mostly read data what about migratory data, thus: Change to : assume other will write to it (ynapse) read and write, then you read and write, then X reads and writes... equent ymmetry and T Alewife use adaptive protocols Gehringer, based on slides by Yan olihin 1 Gehringer, based on slides by Yan olihin (state) nvalidation Protocol : Processornitiated Transactions Problem with protocol Rd, Wr sequence incurs transactions even when no one is sharing (e.g., serial program!) ( ) followed by X or BusUpgr ( ) n general, penalizing serial programs is unacceptable Add exclusive state: nvalid odified (dirty) hared (two or more caches may have copies) xclusive: (only this cache has clean copy, same value as in memory) How to decide or? Need to check whether someone else has copy hared signal on bus: wiredor line asserted in response to PrWr/ PrRd/ PrWr/X PrRd/ PrWr/X PrRd/() PrRd/ PrWr/ PrRd/(~) Gehringer, based on slides by Yan olihin 3 Gehringer, based on slides by Yan olihin : Busnitiated Transactions tate Transition Diagram PrRd PrWr/ X/Flush X/Flush X/Flush PrWr/X PrWr/X PrWr/ / Flush X/Flush X/Flush PrRd/ () 1 X/Flush1 Gehringer, based on slides by Yan olihin 5 / X/ PrRd/ () () means shared line asserted on transaction Gehringer, based on slides by Yan olihin 1

2 Flush vs. Flush1 (Flush' in textbook) Visualization Flush: mandatory Flush' (Flush1): happens only when Cachetocache sharing is used, and, Only one cache flushes data Cache Bus ain emory Gehringer, based on slides by Yan olihin 7 Gehringer, based on slides by Yan olihin Visualization Visualization Gehringer, based on slides by Yan olihin 9 Gehringer, based on slides by Yan olihin 1 Visualization Visualization wr &X (X=) X= One less bus request due to xclusive state, esp. for serial programs Gehringer, based on slides by Yan olihin 11 Gehringer, based on slides by Yan olihin 1

3 Visualization Visualization X= X= X= X= 3 wr &X X=3 Flush BusUpgr X= Note: BusUpgr instead of X Gehringer, based on slides by Yan olihin 13 Gehringer, based on slides by Yan olihin 1 Visualization Visualization X= 3 X=3 X=3 X=3 Flush X= 3 X=3 Gehringer, based on slides by Yan olihin 15 Gehringer, based on slides by Yan olihin 1 Visualization xample (CachetoCache Transfer) Proc Action W1 tate tate tate Bus Action Data From em X=3 X=3 X=3 W3 X cache em Flush1 cache X=3 Referred to as Cachetocache transfer in llinois protocol R 1 / Cache* Gehringer, based on slides by Yan olihin 17 * Data from memory if no cachecache transfer, / Gehringer, based on slides by Yan olihin 1 3

4 xample (CachetoCache Transfer+BusUpgr) LowerLevel Protocol Choices Proc Action W1 W3 tate tate tate Bus Action BusUpgr Data From em cache cache Who supplies data on miss when not in state: memory or cache? Original, lllinois : cache assume cache faster than memory (Cachetocache transfer) Not necessarily true Adds complexity How does memory know it should supply data? (must wait for caches) election algorithm if multiple caches have valid data Valuable for distributed memory ay be cheaper to obtain from nearby cache than distant memory specially when constructed out of P nodes (tanford DAH) R 1 / Cache* * Data from memory if no cachecache transfer, / Gehringer, based on slides by Yan olihin 19 Gehringer, based on slides by Yan olihin Outline protocol Dragon updatebased protocol mpact of protocol optimizations Dragon Writeback Update Protocol Four states xclusiveclean (): and memory have it hared clean (c):, others, and maybe memory, but m not owner hared modified (m): and others but not memory, and m the owner m and c can coexist in different caches, with at most one m odified or dirty (): and, no one else On replacement: c can silently drop, m has to flush No invalid state f in cache, cannot be invalid f not present in cache, can view as being in notpresent or invalid state New processor events: PrRdiss, PrWriss ntroduced to specify actions when block not present in cache New bus transaction: BusUpd Broadcasts single word written on bus; updates other relevant caches Gehringer, based on slides by Yan olihin 1 Gehringer, based on slides by Yan olihin Dragon tate Transition Diagram Dragon: Processornitiated Transactions BusUpd/Update PrRd/ PrRd/ PrRdiss/ () / PrWr/ c PrRdiss/ () PrRdiss/(~) PrWr/BusUpd() c PrRdiss/() PrWr/BusUpd() BusUpd/Update PrWr/ BusUpd() PrWr/ PrWr/BusUpd(~) PrWriss/ ((); BusUpd) m PrWriss/ () PrWriss/ (();BusUpd) m PrWr/BusUpd(~) PrRdiss/(~) PrWr/BusUpd() PrWr/BusUpd() PrWr/ PrRd/ PrWr/BusUpd() PrRd/ PrWr/ Gehringer, based on slides by Yan olihin 3 Gehringer, based on slides by Yan olihin

5 Dragon: Busnitiated Transactions / BusUpd/Update / c BusUpd/Update Cache m Bus ain emory Gehringer, based on slides by Yan olihin 5 Gehringer, based on slides by Yan olihin Gehringer, based on slides by Yan olihin 7 Gehringer, based on slides by Yan olihin wr &X (X=) X= One less bus request due to xclusive state, esp. for serial programs Gehringer, based on slides by Yan olihin 9 Gehringer, based on slides by Yan olihin 3 5

6 X= m X= c X= 3 m c X= 3 c wr &X X=3 m BusUpd Note: BusUpdate instead of BusUpgr (no inval is performed) Gehringer, based on slides by Yan olihin 31 Gehringer, based on slides by Yan olihin 3 X=3 c X=3 m X=3 c X=3 m This is a miss in the and protocols Gehringer, based on slides by Yan olihin 33 Gehringer, based on slides by Yan olihin 3 X=3 c X=3 c X=3 m X=3 c X=3 c X=3 m Note: only one with m is responsible for cachetocache transfer replaces X Gehringer, based on slides by Yan olihin 35 Gehringer, based on slides by Yan olihin 3

7 x d t l x t Dragon xample Proc Action tate tate tate Bus Action Data From em W1 X=3 c X=3 c X=3 m W3 m c c m BusUpd/Upd cache c c m m replaces X Owner responsible for writing back to mem 3 vs. or where writeback only when the line is in state R c c m cache Gehringer, based on slides by Yan olihin 37 Gehringer, based on slides by Yan olihin 3 LowerLevel Protocol Choices Can sharedmodified state be eliminated? f update memory as well on BusUpd transactions (DC Firefly) Dragon protocol doesn t (assumes DRA memory slow to update) hould replacement of an c block be broadcast? Would allow last copy to go to xclusive state and not generate updates Replacement bus transaction is not in critical path, later update may be houldn t update local copy on write hit before controller gets bus Can mess up serialization Coherence, consistency considerations much like writethrough case Outline protocol Dragon updatebased protocol mpact of protocol optimizations n general, many subtle race conditions in protocols But first, let s illustrate quantitative assessment at logical level Gehringer, based on slides by Yan olihin 39 Gehringer, based on slides by Yan olihin Assessing Protocol Tradeoffs ethodology: Use simulator; choose parameters per earlier methodology (default 1B, way cache, byte block, 1 processors; K cache for some) Focus on frequencies, not end performance for now transcends architectural details, but not what we re really after Use idealized memory performance model to avoid changes of reference interleaving across processors with machine parameters Cheap simulation: no need to model contention mpact of Protocol Optimizations vs. (w/ BusUpgr) vs. (w/ X) Traffic (B/s) Traffic (B/s) Barnes/ Barnes/3t Barnes/3t Rdx LU/ LU/3t LU/3tRdx Ocean/ Ocean/3 Ocean/3tRdx Radiosity/ Radiosity/3t Radiosity/3tRdx Radix/ Radix/3t Radix/3t Rdx Raytrace/ ll Raytrace/3t Raytrace/3tRdx x ApplCode/ ApplCode/3t ApplCode/3tRdx Appl ApplData/3t Data/ ApplData/3tRdx OCode/ OCode/3t OCode/3tRdx OData/ OData/3t OData/3tRdx Gehringer, based on slides by Yan olihin 1 = Upgrades instead of readexclusive helps ame story when working sets don t fit for Ocean, Radix, Raytrace Gehringer, based on slides by Yan olihin 7

8 mpact of CacheBlock ize ultiprocessors add new kind of miss to cold, capacity, conflict Coherence misses: Due to invalidations True sharing: Write to same word False sharing: Write to different words Reducing misses architecturally in invalidation protocol Capacity: enlarge cache; increase block size (if spatial locality) Conflict: increase associativity Cold and coherence: only block size ncreasing block size has advantages and disadvantages Can reduce misses if spatial locality is good Can hurt too increase misses due to false sharing if spatial locality not good increase misses due to conflicts in fixedsize cache increase traffic due to fetching unnecessary data and due to false sharing can increase miss C/CC penalty 5 and ummer perhaps. hit F. cost Gehringer, based on slides by Yan olihin 3 mpact of Block ize on iss Rate For default problem size: vary block/line size from 5 Bytes iss rate (%) Barnes/ Barnes/1 Upgrade False sharing True sharing Capacity Cold Barnes/3 Barnes/ Barnes/1 Barnes/5 Lu/ Lu/1 Lu/3 Lu/ Lu/1 Lu/5 Radiosity/ Radiosity/1 Radiosity/3 Radiosity/ Radiosity/1 Radiosity/5 False sharing True sharing Decreases with larger lines: cold, capacity (due to spatial locality), true sharing (due to spatial locality) ncreases with larger lines: false sharing Working set doesn t fit: impact of capacity misses large: (Ocean, Radix) Gehringer, based on slides by Yan olihin iss rate (%) 1 1 Ocean/ Ocean/1 Upgrade Capacity Cold Ocean/3 Ocean/ Ocean/1 Ocean/5 Radix/ Radix/1 Radix/3 Radix/ Radix/1 Radix/5 Raytrace/ Raytrace/1 Raytrace/3 Raytrace/ Raytrace/1 Raytrace/5 mpact of Block ize on Traffic Traffic (bytes/inst) affects performance indirectly through contention 1 1. Traffic (bytes/instruction) Traffic (bytes/flop) Radix/ Radix/1 Radix/3 Radix/ Radix/1 Radix/5 LU/ LU/1 LU/3 LU/ LU/1 LU/5 Ocean/ Ocean/1 Ocean/3 Ocean/ Ocean/1 Ocean/5.1 Traffic (bytes/instructions) Barnes/ Barnes/1 Barnes/3 Barnes/ Barnes/1 Barnes/5 Radiosity/ Radiosity/1 Radiosity/3 Radiosity/ Radiosity/1 Radiosity/5 Raytrace/ Raytrace/1 Raytrace/3 Raytrace/ Raytrace/1 Raytrace/5 Results different than for miss rate: traffic almost always increases When working sets fits, overall traffic still small, except for Radix Fixed overhead is significant component o total traffic often minimized at 13 byte block, not smaller Working set doesn t fit: even 1byte good for Ocean due to capacity traffic behaves in opposite way as the data bus traffic Gehringer, based on slides by Yan olihin 5

Shared Memory Multiprocessors

Parallel Computing Shared Memory Multiprocessors Hwansoo Han Cache Coherence Problem P 0 P 1 P 2 cache load r1 (100) load r1 (100) r1 =? r1 =? 4 cache 5 cache store b (100) 3 100: a 100: a 1 Memory 2 I/O