ECE 5315 PP used in class for assessing cache coherence protocols
Assessing Protocol Design The benchmark programs are executed on a multiprocessor simulator The state transitions observed determine the frequency of various events such as cache misses and bus transactions Evaluate the effect of protocols in terms of design parameters (e.g., bandwidth requirements, cache block size, ) The analysis is based on the frequency of various events, not on the absolute time (since it is simulation)
16 Processors, 1MB 4-way set associative cache, 64B block, MESI protocol State Transitions
Bandwidth Requirements 2 0 0 1 8 0 1 6 0 A d d r e s s b u s D a t a b u s T r a f f i c ( M B / s ) 140 120 100 80 60 40 20 0 x d l t x I l l t E x T r a f f i c ( M B / s ) 80 70 60 50 40 30 20 10 0 E A d d r e s s b u s D a t a b u s Barnes/III Barnes/3St Barnes/3St-RdEx LU/III LU/3St LU/3St-RdEx Ocean/III Ocean/3S Ocean/3St-RdEx Radiosity/III Radiosity/3St Radiosity/3St-RdEx E Radix/III Radix/3St Radix/3St-RdEx Raytrace/III Raytrace/3St Raytrace/3St-RdEx Appl-Code/III Appl-Code/3St Appl-Code/3St-RdEx Appl-Data/III Appl-Data/3St Appl-Data/3St-RdEx OS-Code/III OS-Code/3St OS-Code/3St-RdEx OS-Data/III OS-Data/3St OS-Data/3St-RdEx Parallel program workload 200 MIPS/MFLOPS, 1MB cache III MESI protocol 3St MSI with BusUpgr 3St-Rdex MSI with BusRdX Multiprogram workload
Cache-Miss Types Cold miss --- occurs on the first reference to a memory block by a processor. (compulsory miss) Capacity miss --- occurs when all the blocks that are referenced during the execution of a program do not fit in the cache. Collision miss --- occurs caches with less than full associativity, i.e., the referenced block does not fit in the set. (conflict miss) Coherence miss --- occurs when blocks of data are shared among multiple processors. True sharing: a word in a cache block produced by one processor is used by another processor. False sharing: words accessed by different processors happen to be placed in the same block
Sharing Misses: Illustration True Sharing Miss One writes some words in a cache block The same block in other processors are invalidated The second processor reads one of the modified words (read miss) False Sharing Miss One writes some words in a cache block The same block in other processors are invalidated The second processor reads a different word in the same cache block.
Sharing Misses True Sharing Miss Reduced by increasing the cache block size and the spatial locality of the workload False Sharing Miss Increases as the cache bloc size increases Would not occur if the cache block size is one word Current trend is enlarging the cache block size, which potentially increases false sharing misses
Classification of Cache Misses Miss classi cation First reference to memory block by processor Reason for miss Yes 1. Cold No First access systemwide No Written before Other Reason for elimination of last copy Invalidation Replacement 2. Cold No Yes Modi ed word(s) accessed during lifetime No Old copy with state = invalid still there Yes 3. False-sharingcold Yes 4. True-sharingcold Modi ed word(s) accessed No during lifetime 5. False-sharinginval-cap Yes No 6. True-sharinginval-cap 7. Purefalse-sharing Modi ed word(s) accessed during lifetime Yes 8. Puretrue-sharing No Has block been modi ed since replacement Yes No 9. Purecapacity Modi ed word(s) accessed during lifetime Yes No Modi ed word(s) accessed Yes during lifetime 10. True-sharingcapacity 11. False-sharing-12. True-sharingcap-inval cap-inval
Impact of block size on miss rates (1MB cache) 0. 6 1 2 U p g r a d e U p g r a d e 0. 5 F a l s e s h a r i n g T r u e s h a r i n g 1 0 F a l s e s h a r i n g T r u e s h a r i n g C a p a c i t y C a p a c i t y 0. 4 C o l d 8 C o l d Miss rate (%) 0. 3 Miss rate (%) 6 0. 2 4 0. 1 2 0 8 0 8 6 2 4 8 6 8 Barnes/8 Barnes/16 Barnes/32 Barnes/64 Barnes/128 Barnes/256 Lu/8 Lu/16 Lu/32 Lu/64 Lu/128 Lu/256 Block Size Radiosity/8 Radiosity/16 Radiosity/32 Radiosity/64 Radiosity/128 Radiosity/256 Ocean/8 Ocean/16 Ocean/32 Ocean/64 Ocean/128 Ocean/256 Radix/8 Radix/16 Radix/32 Radix/64 Radix/128 Radix/256 Block Size Raytrace/8 Raytrace/16 Raytrace/32 Raytrace/64 Raytrace/128 Raytrace/256 16 processors, 1MB cache, 4-way set associative Cold, capacity, and true sharing misses tend to decrease with increasing block size False sharing misses tend to increase with block size
Impact of block size on miss rates (64KB cache) Increases in overall miss rates Capacity misses are a much larger portion of overall misses
Impact of Block Size on Bus Traffic (1MB Cache) Traffic affects performance indirectly through contention Traffic (bytes/instructions) 0. 18 0. 16 0. 14 0. 12 0. 1 0. 08 0. 06 0. 04 0. 02 0 Barnes/8 A d d r e s s b u s D a t a b u s Barnes/16 Barnes/32 Barnes/64 Barnes/128 Barnes/256 Radiosity/8 Radiosity/16 Radiosity/32 2 4 2 8 Radiosity/64 Radiosity/128 Radiosity/256 Raytrace/8 Raytrace/16 Raytrace/32 Raytrace/64 Raytrace/128 Raytrace/256 Traffic (bytes/instruction) 10 9 8 7 6 5 4 3 2 1 0 Radix/8 Address bus Data bus Radix/16 Radix/32 Radix/64 Radix/128 Radix/256 Traffic (bytes/flop) 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 LU/8 LU/16 Address bus Data bus LU/32 LU/64 LU/128 LU/256 Ocean/8 Ocean/16 Ocean/32 Ocean/64 Ocean/128 Ocean/256 Data traffic quickly increases with block size Address bus traffic tends to decrease with block size Address traffic overhead comprises a significant fraction for small block sizes
Impact of Block Size on Bus Traffic (64KB Cache) For Ocean, data traffic slowly increases with block size (cmp 1MB)
Drawbacks of Large Cache Blocks The trend toward larger cache block size is driven by availability of increasing density for processors and memory chips This trend bodes poorly for multiprocessor designs because of potential increase in false sharing misses
Countering the effects of large block size Organize data structures or work assignments so that data accessed by different processes is not interleaved finely in the shared address space (software approach) Use sub-blocks within a cache block. One subblock may be valid while others are invalid Small cache blocks are used, but on a miss the system prefetches blocks beyond the accessed block Use adjustable block size (complex) Delay propagating or applying invalidations from a processor until it has issued multiple writes
Update-Based Vs. Invalidation-Based Protocols Update-based protocols perform better, if the processors that were using the data before it was updated are likely to use the new values in the future Invalidation-based protocols perform better, if the processors are never going to use the new values in the future (since traffic update is useless)
Hybrid of Update and Invalidation (Mixed) Start with an update protocol and set a counter to each block (k, called a threshold) Whenever a cache block is accessed by a local processor, the counter is reset to k Every time an update is received for a block, the counter is decremented If the counter goes to zero, the block is locally invalidated Next time an update is generated, the block is switched to the modified state and will stop generating updates If some other processor now accesses the block, the block again will switch to shared state and generate updates
Update vs Invalidate: Miss Rates 0.60 False sharing 2.50 0.50 True sharing Capacity 2.00 Cold 0.40 Miss rate (%) 0.30 Miss rate (%) 1.50 1.00 0.20 0.10 0.50 0.00 0.00 LU/inv LU/upd Ocean/inv Ocean/mix Ocean/upd Raytrace/inv Raytrace/upd Radix/inv Radix/mix Radix/upd K=4 for mixed Lots of coherence misses: updates help Lots of capacity misses: updates hurt (keep data in cache uselessly)
Update Protocols For applications with significant capacity miss rates, the misses increase with an update protocol False sharing decreases with an update protocol The traffic associated with update is quite substantial (many bus transactions vs one in invalidation) The increased traffic can cause contention and can greatly increase the cost of misses Update protocols have greater problems for scalable systems The trend is away from the update based protocols as default