Caches and Prefetching

Size: px

Start display at page:

Download "Caches and Prefetching"

Beryl Greer
5 years ago
Views:

1 Caches and Prefetching

2 Acknowledgements Several of the slides in the deck are from Luis Ceze (Washington), Nima Horanmand (Stony Brook), Mark Hill, David Wood, Karu Sankaralingam (Wisconsin), Abhishek Bhattacharjee(Rutgers). Development of this course is partially supported by Western Digital Corporation. 8/5/2018 2

3 Performance Motivation Processor 10 Memory Want memory to appear: As fast as CPU As large as required by all of the running applications

Registers Caches (SRAM) Memory (DRAM) Controlled by

4 Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Registers Caches (SRAM) Memory (DRAM) Controlled by Hardware Controlled by Software (OS) [SSD? (Flash)] Disk (Magnetic Media)

5 Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled by Hardware Larger Cheaper Caches (SRAM) Memory (DRAM) Faster Controlled by Software (OS) [SSD? (Flash)] Disk (Magnetic Media)

6 Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled by Hardware Larger Cheaper Caches (SRAM) Memory (DRAM) Faster Controlled by Software (OS) [SSD? (Flash)] Disk (Magnetic Media)

7 Caches An automatically managed by hardware Break memory into blocks (typically 64 bytes) and transfer data to/from cache in blocks To exploit spatial locality Core $ Keep recently accessed blocks To exploit temporal locality Both locality principles are typical program behavior Memory

8 Cache Terminology block (cache line): minimum unit that may be cached frame: cache storage location to hold one block hit: block is found in the cache miss: block is not found in the cache miss ratio: fraction of references that miss hit time: time to access the cache miss penalty: time to retrieve block on a miss

9 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x x10120 Memory Final miss ratio is 50%

10 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x10004 Miss 0x10120 Memory Final miss ratio is 50%

11 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x x10120 Miss 0x10000 ( data ) Memory Final miss ratio is 50%

12 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x x10120 Miss 0x10000 ( data ) Memory Final miss ratio is 50%

13 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x x10120 Miss Hit 0x10000 ( data ) Memory Final miss ratio is 50%

14 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x x10120 Miss Hit Miss 0x10000 ( data ) Memory Final miss ratio is 50%

15 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x x10120 Miss Hit Miss 0x10000 ( data ) 0x10120 ( data ) Memory Final miss ratio is 50%

Cache Example Address sequence from core: (assume 8-byte lines) Core 0x10000 0x10004

16 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x x10120 Miss Hit Miss 0x10000 ( data ) 0x10120 ( data ) Memory Final miss ratio is 50%

17 Typical memory hierarchy L1 is usually split separate I$ (inst. cache) and D$ (data cache). L2 and L3 are unified Why multi-level cache? Processor Registers I-TLB L1 I-Cache L1 D-Cache D-TLB L2 Cache L3 Cache (LLC) Main Memory (DRAM)

18 Average Memory Access Time Or AMAT = Hit-time + Miss-rate Miss-penalty If L1 cache hit is 5 cycles (core to L1 and back) L2 cache hit is 20 cycles (core to L2 and back) memory access is 100 cycles (L2 miss penalty) Then at 20% miss ratio in L1 and 40% miss ratio in L2 avg. access: ( ) = 15.4

19 Typical memory hierarchy L1 and L2 are private L3 is shared (why shared?) Processor Core 0 Registers Core 1 Registers I-TLB L1 I-Cache L1 D-Cache D-TLB I-TLB L1 I-Cache L1 D-Cache D-TLB L2 Cache L2 Cache L3 Cache (LLC) Main Memory (DRAM) [Later] Multi-core replicates the top of the hierarchy

20 Intel Nehalem (3.3GHz, 4 cores, 2 threads per core) Memory hierarchy 32K L1-D 256K L2 32K L1-I

21 How to Build a Cache

22 SRAM Overview 1 0 Chained inverters maintain a stable state

23 SRAM Overview 0 1 Chained inverters maintain a stable state

24 SRAM Overview b b Chained inverters maintain a stable state Access gates provide access to the cell

25 SRAM Overview b b Chained inverters maintain a stable state Access gates provide access to the cell Writing to cell involves over-powering storage inverters

SRAM Overview 1 1 0 1 b Chained inverters maintain

cell Writing to cell involves over-powering storage

26 SRAM Overview b Chained inverters maintain a stable state Access gates provide access to the cell Writing to cell involves over-powering storage inverters b 6T SRAM cell 2 access gates 2T per inverter

27 8-bit SRAM Array

28 8-bit SRAM Array wordline

29 8-bit SRAM Array wordline bitlines

30 8 8-bit SRAM Array 3 / 1-of-8 decoder

31 8 8-bit SRAM Array 3 / 1-of-8 decoder wordline

32 8 8-bit SRAM Array 3 / 1-of-8 decoder wordline bitlines

33 Direct-Mapped Cache using SRAM Use middle bits as index. Why? tag[50:16] index[15:6] block offset[5:0] decoder Only one tag comparison data data data state state state tag tag tag data state tag multiplexor = tag match hit? Why take index bits out of the middle?

34 Improving Cache Performance Recall AMAT formula: AMAT = Hit-time + Miss-rate Miss-penalty To improve cache performance, we can improve any of the three components Let s start by reducing miss rate

35 The 4 C s of Cache Misses Compulsory: Never accessed before Capacity: Accessed long ago and already replaced because cache too small Conflict: Neither compulsory nor capacity, because of limited associativity Coherence: (Will discuss in multi-processor lectures)

36 hit rate Cache size Cache size is data capacity (don t count tag and state) Bigger can exploit temporal locality better Not always better Too large a cache Smaller is faster bigger is slower Access time may hurt critical path Too small a cache Limited temporal locality Useful data constantly replaced capacity working set size

37 hit rate Block size Block size is the data that is: associated with an address tag not necessarily the unit of transfer between hierarchies Too small a block Don t exploit spatial locality well Excessive tag overhead Too large a block Useless data transferred Too few total blocks Useful data frequently replaced Common Block Sizes are bytes block size

38 Cache Conflicts What if two blocks alias on a frame? Same index, but different tags Address sequence: 0xDEADBEEF xFEEDBEEF xDEADBEEF tag index block offset 0xDEADBEEF experiences a Conflict miss Not Compulsory (seen it before) Not Capacity (lots of other frames available in cache)

39 Cache Conflicts What if two blocks alias on a frame? Same index, but different tags Address sequence: 0xDEADBEEF xFEEDBEEF xDEADBEEF tag index block offset 0xDEADBEEF experiences a Conflict miss Not Compulsory (seen it before) Not Capacity (lots of other frames available in cache)

40 Cache Conflicts What if two blocks alias on a frame? Same index, but different tags Address sequence: 0xDEADBEEF xFEEDBEEF xDEADBEEF tag index block offset 0xDEADBEEF experiences a Conflict miss Not Compulsory (seen it before) Not Capacity (lots of other frames available in cache)

41 Associativity In cache w/ 8 frames, where does block 12 (b 1100) go?

42 Associativity In cache w/ 8 frames, where does block 12 (b 1100) go? Direct-mapped block goes in exactly one frame (1 frame per set)

43 Associativity In cache w/ 8 frames, where does block 12 (b 1100) go? Fully-associative block goes in any frame (all frames in 1 set) Direct-mapped block goes in exactly one frame (1 frame per set)

44 Associativity In cache w/ 8 frames, where does block 12 (b 1100) go? Fully-associative block goes in any frame (all frames in 1 set) Set-associative block goes in any frame in one set (frames grouped in sets) Direct-mapped block goes in exactly one frame (1 frame per set)

45 hit rate Associativity Larger associativity (for the same size) lower miss rate (fewer conflicts) higher power consumption Smaller associativity lower cost faster hit time 2:1 rule of thumb: for small caches (up to 128KB), 2-way assoc. gives same miss rate as direct-mapped twice the size holding cache and block size constant ~5 for L1-D associativity

46 N-Way Set-Associative Cache tag[50:16] index[15:6] block offset[5:0] set decoder data data data state state state tag tag tag decoder data data data state state state tag tag tag data state tag data state tag multiplexor = multiplexor = multiplexor Note the additional bit(s) moved from index to tag hit?

47 N-Way Set-Associative Cache tag[50:16] index[15:6] block offset[5:0] way set decoder data data data state state state tag tag tag decoder data data data state state state tag tag tag data state tag data state tag multiplexor = multiplexor = multiplexor Note the additional bit(s) moved from index to tag hit?

48 N-Way Set-Associative Cache tag[50:16] index[15:6] block offset[5:0] way set decoder data data data state state state tag tag tag decoder data data data state state state tag tag tag data state tag data state tag multiplexor = multiplexor = multiplexor Note the additional bit(s) moved from index to tag hit?

49 Fully-Associative Cache tag[50:6] block offset[5:0] Keep blocks in cache frames data state (e.g., valid) address tag state state state tag tag tag = = = data data data state tag = Content Addressable Memory (CAM) hit? data multiplexor

50 Cache block replacement algorithms Which block in a set to replace on a miss? Ideal replacement (Belady s Algorithm) Replace block accessed farthest in the future How do you implement it? Least Recently Used (LRU) Optimized for temporal locality (expensive for > 2-way associativity) Why? Not Most Recently Used (NMRU) Track MRU, random select among the rest Same as LRU for 2-sets Random Nearly as good as LRU, sometimes better (when?) Pseudo-LRU Used in caches with high associativity Examples: Tree-PLRU, Bit-PLRU

51 Tree-based Pseudo LRU Idea is to ensure you do not replace recently accessed block Not guaranteed to replace *least* recently used block Best effort to replace *least* recently used block For 2-way set associative keep 1 bit The bit indicates which line of the two has been reference more recently Replace which is not recently accessed For 4-way set associative need 3 bits

52 Tree-based pseudo LRU Access stream ABCDE, 4-way set-associative Recency vector 0 0 0

53 Tree-based pseudo LRU Access stream ABCDE, 4-way set-associative Recency vector A

54 Tree-based pseudo LRU Access stream ABCDE, 4-way set-associative Recency vector A A B

55 Tree-based pseudo LRU Access stream ABCDE, 4-way set-associative Recency vector A A B A B C

56 Tree-based pseudo LRU Access stream ABCDE, 4-way set-associative Recency vector A A B A B C A B D C

57 Tree-based pseudo LRU Access stream ABCDE, 4-way set-associative A 1 1 A 1 0 B A 0 0 B C 0 0 A 0 0 B C 1 0 D A 0 0 B C 1 D Access E Recency vector

58 Tree-based pseudo LRU Access stream ABCDE, 4-way set-associative A 1 1 A 1 0 B A 0 0 B C 0 0 A 0 0 B C 1 0 D A 0 0 B C 1 D Access E Recency vector

59 Tree based Pseudo-LRU Which is the next victim? E B D C What is the advantage of pseudo-lru?

60 Tree based Pseudo-LRU Which is the next victim? E B D C What is the advantage of pseudo-lru? A few bit flips to encode recency information as per position of insertion A bit vector lookup to determine victim Victim not always LRU but at not MRU either More often its close to LRU

61 Tree based Pseudo-LRU E 1 B 1 D 1 C Which is the next victim? C Is it true LRU? What is the advantage of pseudo-lru? A few bit flips to encode recency information as per position of insertion A bit vector lookup to determine victim Victim not always LRU but at not MRU either More often its close to LRU

62 Cache design issues 9/12/

63 1 Non-blocking caches Should later cache lookups stall if there is a miss? Can induce head-of-the-line stall due to earlier miss

64 1 Non-blocking caches Should later cache lookups stall if there is a miss? Can induce head-of-the-line stall due to earlier miss Observation 1: misses usually happen in bursts; it is helpful to overlap latencies of multiple parallel misses

65 1 Non-blocking caches Should later cache lookups stall if there is a miss? Can induce head-of-the-line stall due to earlier miss Observation 1: misses usually happen in bursts; it is helpful to overlap latencies of multiple parallel misses Observation 2: hit can happen while a miss is pending Particularly true for OOO execution Observation 3: main memory system can supports a large number of in-flight requests

66 1 Non-blocking caches Should later cache lookups stall if there is a miss? Can induce head-of-the-line stall due to earlier miss Observation 1: misses usually happen in bursts; it is helpful to overlap latencies of multiple parallel misses Observation 2: hit can happen while a miss is pending Particularly true for OOO execution Observation 3: main memory system can supports a large number of in-flight requests Idea: let s make caches non-blocking i.e., cache keeps accepting new requests while waiting for misses to be handled (how many new requests?) More parallelism in cache subsystem important for superscalar and OOO

67 Implementing Non-blocking Caches On a miss: Send the request to next level of cache/memory and Put the miss information in a Miss Status Holding Register (MSHR) Instruction tag (ROB#), address, load-or-store, store value, Why? Tag miss request to later cache/memory with MSHR entry ID

68 Implementing Non-blocking Caches On a miss: Send the request to next level of cache/memory and Put the miss information in a Miss Status Holding Register (MSHR) Instruction tag (ROB#), address, load-or-store, store value, Why? Tag miss request to later cache/memory with MSHR entry ID When memory response arrives:

69 Implementing Non-blocking Caches On a miss: Send the request to next level of cache/memory and Put the miss information in a Miss Status Holding Register (MSHR) Instruction tag (ROB#), address, load-or-store, store value, Why? Tag miss request to later cache/memory with MSHR entry ID When memory response arrives: Find the corresponding MSHR entry using the MSHR tag Merge memory response data with store value (if store miss) and write to cache Broadcast results on CDB (if load miss)

70 Implementing Non-blocking Caches If a new load/store request to an already missing line Can merge the new miss into existing MSHR Instead of sending another request to next cache level/main memory MSHR entries should be big enough to keep info for multiple pending misses to the same line

71 Implementing Non-blocking Caches If a new load/store request to an already missing line Can merge the new miss into existing MSHR Instead of sending another request to next cache level/main memory MSHR entries should be big enough to keep info for multiple pending misses to the same line How many MSHRs?

72 Implementing Non-blocking Caches If a new load/store request to an already missing line Can merge the new miss into existing MSHR Instead of sending another request to next cache level/main memory MSHR entries should be big enough to keep info for multiple pending misses to the same line How many MSHRs? Typically 10s. Depends upon targeted b/w and average latency of each miss

73 Implementing Non-blocking Caches If a new load/store request to an already missing line Can merge the new miss into existing MSHR Instead of sending another request to next cache level/main memory MSHR entries should be big enough to keep info for multiple pending misses to the same line How many MSHRs? Typically 10s. Depends upon targeted b/w and average latency of each miss Remember Little s law from first lecture? E.g., 11 at L1 level in current Intel Xeon (server) processors

74 2 Parallel vs. Serial Caches Tag and Data usually separate SRAMs tag is smaller & faster State bits stored along with tags Valid bit, LRU bit(s), Parallel access to tag and data reduces latency (good for L1) Serial access to tag and data reduces power (good for L2+) enable = = = = = = = = valid? valid? hit? data hit? data

75 3 Victim caches Associativity is expensive Performance overhead from extra muxes Power overhead from reading and checking more tags and data Conflicts are expensive Performance from extra misses Observation: Conflicts don t occur in all sets Idea: use a fully-associative victim cache to absorbs blocks displaced from the main cache Extend associativity of sets experiencing many conflict missses

76 Victim cache Access Sequence: E A B N M C K L D J 4-way Set-Associative L1 Cache DE A B C DC X Y Z MN J KJ KL ML P Q R Every access is a miss! ABCDE and JKLMN do not fit in a 4-way set associative cache Provide extra associativity, but not for all sets

Victim cache Access Sequence: 4-way Set-Associative L1 Cache 4-way Set-Associative L1 Cache + Fully-Associative Victim Cache E A B N M C K L D J DE A B C DC X Y Z MN J KJ KL ML NJ KJ KL ML P Q R P Q

77 Victim cache Access Sequence: 4-way Set-Associative L1 Cache 4-way Set-Associative L1 Cache + Fully-Associative Victim Cache E A B N M C K L D J DE A B C DC X Y Z MN J KJ KL ML NJ KJ KL ML P Q R P Q R Every access is a miss! ABCDE and JKLMN do not fit in a 4-way set associative cache AE BA CB DC X Y Z Victim cache provides a fifth way so long as only four sets overflow into it at the same time DC AB JKLM Can even provide 6 th or 7 th ways Provide extra associativity, but not for all sets

78 4 Cache inclusivity Core often accesses blocks not present in any $ Should block be allocated in L3, L2, and L1? Called Inclusive caches Waste of space Requires forced evict (e.g., force evict from L1 on evict from L2+) Only allocate blocks in L1 Called non-inclusive caches (why not exclusive?) Some processors combine both L3 is inclusive of L1 and L2 Why do inclusive caches? L2 is non-inclusive of L1 (like a large victim cache)

79 5 Write propagation When to propagate new value to (lower level) memory? Option #1: Write-through: immediately On hit, update cache Immediately send the write to the next level Option #2: Write-back: when block is replaced/evicted Requires additional dirty bit per block Replace clean block: no extra traffic Replace dirty block: extra writeback of block What are the trade-offs? 37

80 Write propagation comparison Write-through Requires additional bus bandwidth Consider repeated write hits Next level must handle small writes (1, 2, 4, 8-bytes) + No need for valid bits in cache + No need to handle writeback operations Simplifies miss handling (no WBB) Sometimes used for L1 caches (for example, by IBM) Write-back + Key advantage: uses less bandwidth Reverse of other pros/cons above Used by Intel and AMD Second-level and beyond are generally write-back caches 38

81 6 Write/store miss handling Should we bring the data to the cache on a write miss? For L1 cache store == write 39

82 Write miss handling How is a write miss actually handled? Write-allocate: fill block from next level, then write it + Decreases read misses (next read to block will hit) Requires additional bandwidth Commonly used (especially with write-back caches) Write-non-allocate: just write to next level, no allocate Potentially more read misses + Uses less bandwidth Use with write-through 40

83 Write misses latency Read miss? Load can t go on without the data, it must stall Write miss? What happens to the store instruction when it reaches the head of the ROB and it misses in the cache?

84 Write misses latency Read miss? Load can t go on without the data, it must stall Write miss? What happens to the store instruction when it reaches the head of the ROB and it misses in the cache? Stalls retirement of later instructions What can we do about it? Remember stores can be end of a dependence chain of instructions Is the latency to store in critical path?

85 Write Misses and Write Buffers Read miss? Load can t go on without the data, it must stall Write miss? Technically, if no instruction is waiting for data, why stall? Write buffer (a.k.a., store buffer): a small buffer How does it help? Write buffer vs. writeback-buffer Write buffer: in front of L1 D$, for hiding store misses Writeback buffer: behind L1 D$, for hiding writebacks Any possible issues? WBB Processor WB Cache Next-level cache 42

86 Write Misses and Write Buffers Read miss? Load can t go on without the data, it must stall Write miss? Technically, if no instruction is waiting for data, why stall? Write buffer (a.k.a., store buffer): a small buffer How does it help? Write buffer vs. writeback-buffer Write buffer: in front of L1 D$, for hiding store misses Writeback buffer: behind L1 D$, for hiding writebacks Any possible issues? Forwarding again?? WBB Processor WB Cache Next-level cache 42

87 7 Local vs Global Miss Rates Local hit/miss rate: Percent of references to cache hit (e.g, 90%) Local miss rate is (100% - local hit rate), (e.g., 10%) Global hit/miss rate: Misses per instruction (1 miss per 30 instructions) Instructions per miss (3% of instructions miss) Above assumes loads/stores are 1 in 3 instructions Consider second-level cache hit rate L1: 2 misses per 100 instructions L2: 1 miss per 100 instructions 43

88 7 Local vs Global Miss Rates Local hit/miss rate: Percent of references to cache hit (e.g, 90%) Local miss rate is (100% - local hit rate), (e.g., 10%) Global hit/miss rate: Misses per instruction (1 miss per 30 instructions) Instructions per miss (3% of instructions miss) Above assumes loads/stores are 1 in 3 instructions Consider second-level cache hit rate L1: 2 misses per 100 instructions L2: 1 miss per 100 instructions L2 local miss rate -> 50% 43

89 Techniques We ve Seen So Far Use Caching to reduce memory latency Use wide out-of-order execution to hide memory latency By overlapping misses with other useful work Cannot efficiently go much wider than several instructions Need a different strategy 44

90 Techniques We ve Seen So Far Use Caching to reduce memory latency Use wide out-of-order execution to hide memory latency By overlapping misses with other useful work Cannot efficiently go much wider than several instructions Neither is enough for server applications Not much locality (mostly accessing linked data structures) Not much ILP and MLP Server apps spend 50-66% of their time stalled on memory Need a different strategy 44

91 Prefetching 9/12/

92 Prefetching (1) Fetch data ahead of demand Why it could be useful?

93 Prefetching (1) Fetch data ahead of demand Why it could be useful? Can any caching technique avoid compulsory miss? Caching technique does not tell what could be needed next o Replacement policy predicts what in cache may not be required

94 Prefetching (1) Fetch data ahead of demand Why it could be useful? Can any caching technique avoid compulsory miss? Caching technique does not tell what could be needed next o Replacement policy predicts what in cache may not be required Main challenges: Knowing what to fetch Fetching useless blocks wastes resources Knowing when to fetch Too early pollutes cache or gets thrown out before use) Fetching too late defeats purpose of pre -fetching

95 Prefetching (1) Fetch data ahead of demand Why it could be useful? Can any caching technique avoid compulsory miss? Caching technique does not tell what could be needed next o Replacement policy predicts what in cache may not be required Main challenges: Knowing what to fetch Fetching useless blocks wastes resources Knowing when to fetch Too early pollutes cache or gets thrown out before use) Fetching too late defeats purpose of pre -fetching Prefetching must be accurate and timely

96 Prefetching (2) Without prefetching:

97 Prefetching (2) Without prefetching: Load time

98 Prefetching (2) Without prefetching: L1 Load time

99 Prefetching (2) Without prefetching: L1 L2 Load time

100 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data time

101 Prefetching (2) Without prefetching: L1 L2 DRAM Load time Total Load-to-Use Latency Data

102 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency

103 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch

104 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch

105 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Load

106 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Load Data

107 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Load Data Much improved Load-to-Use Latency

108 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Or: Load Data Much improved Load-to-Use Latency

109 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Or: Load Data Much improved Load-to-Use Latency Prefetch

110 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Or: Load Data Much improved Load-to-Use Latency Prefetch

111 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Or: Load Data Much improved Load-to-Use Latency Prefetch Load

112 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Or: Load Data Much improved Load-to-Use Latency Prefetch Load Data

113 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Or: Load Data Much improved Load-to-Use Latency Prefetch Load Data Somewhat improved Latency Prefetching must be accurate and timely

114 Types of Prefetching Software Hardware

115 Types of Prefetching Software By compiler By programmer Hardware

116 Types of Prefetching Software By compiler By programmer Hardware Next-Line, Adjacent-Line Next-N-Line Stream Buffers Stride Localized (PC-based) Pointer Correlation

117 Software Prefetching (1) Prefetch data using explicit instructions Inserted by compiler and/or programmer

118 Software Prefetching (1) Prefetch data using explicit instructions Inserted by compiler and/or programmer Put prefetched value into

119 Software Prefetching (1) Prefetch data using explicit instructions Inserted by compiler and/or programmer Put prefetched value into Register (binding prefetch) Also called hoisting (same as reordering instruction) Basically, just moving the load instruction up in the program

120 Software Prefetching (1) Prefetch data using explicit instructions Inserted by compiler and/or programmer Put prefetched value into Register (binding prefetch) Also called hoisting (same as reordering instruction) Basically, just moving the load instruction up in the program Cache (non-binding prefetch) Requires ISA support May get evicted from cache before demand

121 Software Prefetching (2) A B C R1 = [R2] R3 = R1+4 Hoisting is prone to many problems:

122 Software Prefetching (2) A R1 = R1-1 WAW Violated R1 = [R2] A R1 = R1-1 B C R1 = [R2] R3 = R1+4 B C R3 = R1+4 Hoisting is prone to many problems: Must be aware of dependences

123 Software Prefetching (2) A R1 = R1-1 WAW Violated R1 = [R2] A R1 = R1-1 B C R1 = [R2] R3 = R1+4 B C R3 = R1+4 Hoisting is prone to many problems: Must be aware of dependences Must not cause exceptions not possible in the original execution Increases register pressure for the compiler

Software Prefetching (2) A R1 = R1-1 WAW Violated R1 = [R2] A R1 = R1-1 PREFETCH[R2] A R1 = R1-1 B C R1 = [R2] R3 = R1+4 B C R3 = R1+4 B C R1 = [R2] R3 = R1+4 Hoisting is prone to many problems:

124 Software Prefetching (2) A R1 = R1-1 WAW Violated R1 = [R2] A R1 = R1-1 PREFETCH[R2] A R1 = R1-1 B C R1 = [R2] R3 = R1+4 B C R3 = R1+4 B C R1 = [R2] R3 = R1+4 Hoisting is prone to many problems: Must be aware of dependences Must not cause exceptions not possible in the original execution Increases register pressure for the compiler Using a prefetch instruction can avoid all these problems

125 Software Prefetching (3) for (I = 1; I < rows; I++) { for (J = 1; J < columns; J++) { prefetch(&x[i+1,j]); sum = sum + x[i,j]; } } Prefetch instruction reads the containing block from the memory and puts it in the cache

126 Software Prefetching (4) Pros: Gives programmer control and flexibility Allows for complex (compiler) analysis No (major) hardware modifications needed

127 Software Prefetching (4) Pros: Gives programmer control and flexibility Allows for complex (compiler) analysis No (major) hardware modifications needed Cons: Prefetch instructions increase code footprint May cause more I$ misses, code alignment issues Hard to perform timely prefetches At IPC=2 and 100-cycle memory move load 200 inst. earlier Might not even have 200 inst. in current function Prefetching earlier and more often leads to low accuracy Program may go down a different path (block B in prev. slides)

128 Hardware Prefetching Hardware monitors memory accesses Looks for common patterns Makes predictions Predicted addresses are placed into prefetch queue Queue is checked when no demand accesses waiting Prefetches look like load requests to the mem. hierarchy

129 Hardware Prefetching Hardware monitors memory accesses Looks for common patterns Makes predictions Predicted addresses are placed into prefetch queue Queue is checked when no demand accesses waiting Prefetches look like load requests to the mem. hierarchy Prefetchers may trade bandwidth for latency Extra bandwidth used only when guessing incorrectly Latency reduced only when guessing correctly

130 Hardware Prefetching Hardware monitors memory accesses Looks for common patterns Makes predictions Predicted addresses are placed into prefetch queue Queue is checked when no demand accesses waiting Prefetches look like load requests to the mem. hierarchy Prefetchers may trade bandwidth for latency Extra bandwidth used only when guessing incorrectly Latency reduced only when guessing correctly No need to change software

131 Hardware Prefetcher Design Space What to prefetch? Predict regular patterns (x, x+8, x+16, ) Predict correlated patterns (A..B->C, B..C->J, A..C->K, ) When to prefetch? Where to put prefetched data?

132 Hardware Prefetcher Design Space What to prefetch? Predict regular patterns (x, x+8, x+16, ) Predict correlated patterns (A..B->C, B..C->J, A..C->K, ) When to prefetch? On every reference lots of lookup/prefetch overhead On every miss patterns filtered by caches On prefetched-data hits (positive feedback) Where to put prefetched data?

133 Hardware Prefetcher Design Space What to prefetch? Predict regular patterns (x, x+8, x+16, ) Predict correlated patterns (A..B->C, B..C->J, A..C->K, ) When to prefetch? On every reference lots of lookup/prefetch overhead On every miss patterns filtered by caches On prefetched-data hits (positive feedback) Where to put prefetched data? Prefetch buffers Caches

134 Prefetching at Different Levels Processor I-TLB Registers L1 I-Cache L1 D-Cache D-TLB Intel Core2 Prefetcher Locations L2 Cache L3 Cache (LLC) Real CPUs have multiple prefetchers w/ different strategies Usually closer to the core (easier to detect patterns) Prefetching at LLC is hard. Why?

135 Prefetching at Different Levels Processor I-TLB Registers L1 I-Cache L1 D-Cache D-TLB Intel Core2 Prefetcher Locations L2 Cache L3 Cache (LLC) Real CPUs have multiple prefetchers w/ different strategies Usually closer to the core (easier to detect patterns) Prefetching at LLC is hard. Why? Mixed access patterns from many cores in shared LLC Typically heavily banked (why this is a problem?)

136 Next-Line (or Adjacent-Line) Prefetching On request for line X, prefetch X+1 Assumes spatial locality why not just increase block size? Should stop at physical (OS) page boundaries (why?)

137 Next-Line (or Adjacent-Line) Prefetching On request for line X, prefetch X+1 Assumes spatial locality why not just increase block size? Should stop at physical (OS) page boundaries (why?) Works for I$ and D$ Instructions execute sequentially Large data structures often span multiple blocks

138 Next-Line (or Adjacent-Line) Prefetching On request for line X, prefetch X+1 Assumes spatial locality why not just increase block size? Should stop at physical (OS) page boundaries (why?) Works for I$ and D$ Instructions execute sequentially Large data structures often span multiple blocks Simple, but usually not timely

139 Next-N-Line Prefetching On request for line X, prefetch X+1, X+2,, X+N N is called prefetch depth or prefetch degree

140 Next-N-Line Prefetching On request for line X, prefetch X+1, X+2,, X+N N is called prefetch depth or prefetch degree Must carefully tune depth N. Large N is more likely to be timely (always?) more aggressive more likely to make a mistake Might evict something useful more expensive need storage for prefetched lines Might delay useful request on interconnect or port

141 Next-N-Line Prefetching On request for line X, prefetch X+1, X+2,, X+N N is called prefetch depth or prefetch degree Must carefully tune depth N. Large N is more likely to be timely (always?) more aggressive more likely to make a mistake Might evict something useful more expensive need storage for prefetched lines Might delay useful request on interconnect or port Still simple, but more timely than Next-Line

142 Stride Prefetching (1) Column in matrix Elements in array of structs Access patterns often follow a stride Example 1: Accessing column of elements in a matrix Example 2: Accessing elements in array of structs

143 Stride Prefetching (1) Column in matrix Elements in array of structs Access patterns often follow a stride Example 1: Accessing column of elements in a matrix Example 2: Accessing elements in array of structs Detect stride S, prefetch depth N Prefetch X+S, X+2S,, X+NS

144 Stride Prefetching (2) Must carefully select depth N Same constraints as Next-N-Line prefetcher How to tell the diff. between A[i] A[i+1] and X Y? Wait until you see the same stride a few times Can vary prefetch depth based on confidence More consecutive strided accesses higher confidence New access to A+3S Last Addr Stride Count A+2S S 2 >2 Do prefetch? = + + A+4S (addr to prefetch) Update count

145 Localized Stride Prefetchers (1) What if multiple strides are interleaved? Y = A + X? Load R1 = [R2] Load R3 = [R4] Add R5, R1, R3 Store [R6] = R5 A, X, Y, A+S0, X+S1, Y+S2, A+2S0, X+2S1, Y+2S2,

146 Localized Stride Prefetchers (1) What if multiple strides are interleaved? No clearly-discernible stride Y = A + X? Load R1 = [R2] Load R3 = [R4] Add R5, R1, R3 Store [R6] = R5 A, X, Y, A+S0, X+S1, Y+S2, A+2S0, X+2S1, Y+2S2, (X-A) (Y-X) (X-A) (Y-X) (X-A) (Y-X) (A+S-Y) (A+S-Y) (A+S-Y)

147 Localized Stride Prefetchers (1) What if multiple strides are interleaved? No clearly-discernible stride Y = A + X? Load R1 = [R2] Load R3 = [R4] Add R5, R1, R3 Store [R6] = R5 A, X, Y, A+S0, X+S1, Y+S2, A+2S0, X+2S1, Y+2S2, (X-A) (Y-X) (X-A) (Y-X) (X-A) (Y-X) (A+S-Y) (A+S-Y) (A+S-Y) Observation: Accesses to structures usually corelates to an instruction PC

148 Localized Stride Prefetchers (1) What if multiple strides are interleaved? No clearly-discernible stride Y = A + X? Load R1 = [R2] Load R3 = [R4] Add R5, R1, R3 Store [R6] = R5 A, X, Y, A+S0, X+S1, Y+S2, A+2S0, X+2S1, Y+2S2, (X-A) (Y-X) (X-A) (Y-X) (X-A) (Y-X) (A+S-Y) (A+S-Y) (A+S-Y) Observation: Accesses to structures usually corelates to an instruction PC Idea: Use an array of strides, indexed by PC

count++ if yes, count-- or count=0 if no If count is high, prefetch (last address + stride) PC: 0x409A34 PC: 0x409A38

149 Localized Stride Prefetchers (2) Store PC, last address, last stride, and count in a Reference Prediction Table (RPT) On access, check RPT Same stride? count++ if yes, count-- or count=0 if no If count is high, prefetch (last address + stride) PC: 0x409A34 PC: 0x409A38 PC: 0x409A40 Load R1 = [R2] Load R3 = [R4] Store [R6] = R5 Tag 0x409 0x409 Last Addr Stride Count A+3S S0 2 + X+3S1 S1 2 If confident about the stride (count > C min ), prefetch (A+4S) 0x409 Y+2S2 S2 1

150 Other Prefetch Patterns Sometimes accesses are highly predictable, but no strides Linked data structures (e.g., lists or trees) A B C D E F Linked-list traversal

151 Other Prefetch Patterns Sometimes accesses are highly predictable, but no strides Linked data structures (e.g., lists or trees) A B C D E F Linked-list traversal D A F B C Actual memory layout (no chance to detect a stride) E

152 Pointer Prefetching (1) Data filled on cache miss (512 bits of data) Which ones are pointers? Nope Nope Nope Nope Maybe! Maybe! Nope Nope Pointers usually look different

153 Pointer Prefetching (1) Data filled on cache miss (512 bits of data) Which ones are pointers? Nope Nope Nope Nope Maybe! Maybe! Nope Nope Go ahead and prefetch these (needs some help from the TLB) Pointers usually look different

154 Pointer Prefetching (1) Data filled on cache miss (512 bits of data) Which ones are pointers? Nope Nope Nope Nope Maybe! Maybe! Nope Nope struct bintree_node_t { int data1; int data2; struct bintree_node_t * left; struct bintree_node_t * right; }; Go ahead and prefetch these (needs some help from the TLB) This allows you to walk the tree (or other pointer-based data structures which are typically hard to prefetch) Pointers usually look different

155 Pointer Prefetching (2) Relatively cheap to implement Don t need extra hardware to store patterns But can fetch a lot of junk Limited lookahead makes timely prefetches hard Can t get next pointer until fetched data block What is the prefetch depth? Stride Prefetcher: X X+S Access Latency Access Latency X+2S Access Latency Pointer Prefetcher: A Access Latency B Access Latency C Access Latency

156 Pair-wise Temporal Correlation (1) Accesses exhibit temporal correlation If E followed D in the past if we see D, prefetch E Somewhat similar to history-based branch prediction Correlation Table Linked-list traversal D F A D F E? A B C D E F Actual memory layout B A B 11 D F C E B C C D E A B C E F 01 Can use recursively to get more lookahead

157 Pair-wise Temporal Correlation (2) Many patterns more complex than linked lists Can be represented by a Markov Model Required tracking multiple potential successors Number of candidates is called breadth Correlation Table A B C.67 Markov Model D E F Recursive breadth & depth grows exponentially.5 D F A B C E D F A B C E C E B C D A E? C? F?

158 Increasing Correlation History Length Like branch prediction, longer history can provide more accuracy (a.k.a, looking for sequence of accesses) And increases training time Use history hash for lookup E.g., XOR the bits of the addresses of the last K accesses Tree traversal: ABDBEBACFCGCA A B A B C B D D B B E F E D A D E F G E B B B B A C A C Better accuracy, larger storage cost

159 Memory interface Stream Buffers (1) Used to avoid cache pollution caused by deep prefetching FIFO Each buffer holds one stream of sequentially prefetched lines Keep next-n available in buffer On a load miss, check the head of all buffers if match, pop the entry from FIFO, fetch the N+1 st line into the buffer if miss, allocate a new stream buffer (use LRU for recycling) Cache FIFO FIFO FIFO

160 Stream Buffers (2) Can incorporate stride prediction mechanisms to support non-unit-stride streams Can extend to quasi-sequential stream buffer On request Y in [X X+N], advance by Y-X+1 Allows buffer to work when items are skipped Requires expensive (associative) comparison

161 Example: Prefetchers in Intel Sandy Bridge Data L1 1) PC-localized stride prefetcher 2) Next line prefetcher Only on an ascending access to very recently loaded data L2 3) Stream buffer/prefetcher: Prefetch the cache line which pairs with current one to make a 128-byte aligned chunk 4) Stream prefetcher: detects streams of requests made by L1 (I and D) and prefetches lines down the stream # of lines to prefetch depends on # of outstanding requests from L1 Far lines are only prefetched to L3; closer ones are brought to L2 See Intel Architectures Optimization Reference Manual for more details

162 Metrics for prefetching Coverage = # of $ miss eliminated by prefetching) / # of total $ miss # total cache miss = # of misses eliminated by prefetching + # of misses not eliminated by prefetching Accuracy = # of misses eliminated by prefetching/ (# of useless prefetch + # of misses eliminated by prefetching) Timeliness, a qualitative metric 9/12/

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand Cache Design Basics Nima Honarmand Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled