Caches and Prefetching
|
|
- Beryl Greer
- 5 years ago
- Views:
Transcription
1 Caches and Prefetching
2 Acknowledgements Several of the slides in the deck are from Luis Ceze (Washington), Nima Horanmand (Stony Brook), Mark Hill, David Wood, Karu Sankaralingam (Wisconsin), Abhishek Bhattacharjee(Rutgers). Development of this course is partially supported by Western Digital Corporation. 8/5/2018 2
3 Performance Motivation Processor 10 Memory Want memory to appear: As fast as CPU As large as required by all of the running applications
4 Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Registers Caches (SRAM) Memory (DRAM) Controlled by Hardware Controlled by Software (OS) [SSD? (Flash)] Disk (Magnetic Media)
5 Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled by Hardware Larger Cheaper Caches (SRAM) Memory (DRAM) Faster Controlled by Software (OS) [SSD? (Flash)] Disk (Magnetic Media)
6 Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled by Hardware Larger Cheaper Caches (SRAM) Memory (DRAM) Faster Controlled by Software (OS) [SSD? (Flash)] Disk (Magnetic Media)
7 Caches An automatically managed by hardware Break memory into blocks (typically 64 bytes) and transfer data to/from cache in blocks To exploit spatial locality Core $ Keep recently accessed blocks To exploit temporal locality Both locality principles are typical program behavior Memory
8 Cache Terminology block (cache line): minimum unit that may be cached frame: cache storage location to hold one block hit: block is found in the cache miss: block is not found in the cache miss ratio: fraction of references that miss hit time: time to access the cache miss penalty: time to retrieve block on a miss
9 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x x10120 Memory Final miss ratio is 50%
10 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x10004 Miss 0x10120 Memory Final miss ratio is 50%
11 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x x10120 Miss 0x10000 ( data ) Memory Final miss ratio is 50%
12 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x x10120 Miss 0x10000 ( data ) Memory Final miss ratio is 50%
13 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x x10120 Miss Hit 0x10000 ( data ) Memory Final miss ratio is 50%
14 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x x10120 Miss Hit Miss 0x10000 ( data ) Memory Final miss ratio is 50%
15 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x x10120 Miss Hit Miss 0x10000 ( data ) 0x10120 ( data ) Memory Final miss ratio is 50%
16 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x x10120 Miss Hit Miss 0x10000 ( data ) 0x10120 ( data ) Memory Final miss ratio is 50%
17 Typical memory hierarchy L1 is usually split separate I$ (inst. cache) and D$ (data cache). L2 and L3 are unified Why multi-level cache? Processor Registers I-TLB L1 I-Cache L1 D-Cache D-TLB L2 Cache L3 Cache (LLC) Main Memory (DRAM)
18 Average Memory Access Time Or AMAT = Hit-time + Miss-rate Miss-penalty If L1 cache hit is 5 cycles (core to L1 and back) L2 cache hit is 20 cycles (core to L2 and back) memory access is 100 cycles (L2 miss penalty) Then at 20% miss ratio in L1 and 40% miss ratio in L2 avg. access: ( ) = 15.4
19 Typical memory hierarchy L1 and L2 are private L3 is shared (why shared?) Processor Core 0 Registers Core 1 Registers I-TLB L1 I-Cache L1 D-Cache D-TLB I-TLB L1 I-Cache L1 D-Cache D-TLB L2 Cache L2 Cache L3 Cache (LLC) Main Memory (DRAM) [Later] Multi-core replicates the top of the hierarchy
20 Intel Nehalem (3.3GHz, 4 cores, 2 threads per core) Memory hierarchy 32K L1-D 256K L2 32K L1-I
21 How to Build a Cache
22 SRAM Overview 1 0 Chained inverters maintain a stable state
23 SRAM Overview 0 1 Chained inverters maintain a stable state
24 SRAM Overview b b Chained inverters maintain a stable state Access gates provide access to the cell
25 SRAM Overview b b Chained inverters maintain a stable state Access gates provide access to the cell Writing to cell involves over-powering storage inverters
26 SRAM Overview b Chained inverters maintain a stable state Access gates provide access to the cell Writing to cell involves over-powering storage inverters b 6T SRAM cell 2 access gates 2T per inverter
27 8-bit SRAM Array
28 8-bit SRAM Array wordline
29 8-bit SRAM Array wordline bitlines
30 8 8-bit SRAM Array 3 / 1-of-8 decoder
31 8 8-bit SRAM Array 3 / 1-of-8 decoder wordline
32 8 8-bit SRAM Array 3 / 1-of-8 decoder wordline bitlines
33 Direct-Mapped Cache using SRAM Use middle bits as index. Why? tag[50:16] index[15:6] block offset[5:0] decoder Only one tag comparison data data data state state state tag tag tag data state tag multiplexor = tag match hit? Why take index bits out of the middle?
34 Improving Cache Performance Recall AMAT formula: AMAT = Hit-time + Miss-rate Miss-penalty To improve cache performance, we can improve any of the three components Let s start by reducing miss rate
35 The 4 C s of Cache Misses Compulsory: Never accessed before Capacity: Accessed long ago and already replaced because cache too small Conflict: Neither compulsory nor capacity, because of limited associativity Coherence: (Will discuss in multi-processor lectures)
36 hit rate Cache size Cache size is data capacity (don t count tag and state) Bigger can exploit temporal locality better Not always better Too large a cache Smaller is faster bigger is slower Access time may hurt critical path Too small a cache Limited temporal locality Useful data constantly replaced capacity working set size
37 hit rate Block size Block size is the data that is: associated with an address tag not necessarily the unit of transfer between hierarchies Too small a block Don t exploit spatial locality well Excessive tag overhead Too large a block Useless data transferred Too few total blocks Useful data frequently replaced Common Block Sizes are bytes block size
38 Cache Conflicts What if two blocks alias on a frame? Same index, but different tags Address sequence: 0xDEADBEEF xFEEDBEEF xDEADBEEF tag index block offset 0xDEADBEEF experiences a Conflict miss Not Compulsory (seen it before) Not Capacity (lots of other frames available in cache)
39 Cache Conflicts What if two blocks alias on a frame? Same index, but different tags Address sequence: 0xDEADBEEF xFEEDBEEF xDEADBEEF tag index block offset 0xDEADBEEF experiences a Conflict miss Not Compulsory (seen it before) Not Capacity (lots of other frames available in cache)
40 Cache Conflicts What if two blocks alias on a frame? Same index, but different tags Address sequence: 0xDEADBEEF xFEEDBEEF xDEADBEEF tag index block offset 0xDEADBEEF experiences a Conflict miss Not Compulsory (seen it before) Not Capacity (lots of other frames available in cache)
41 Associativity In cache w/ 8 frames, where does block 12 (b 1100) go?
42 Associativity In cache w/ 8 frames, where does block 12 (b 1100) go? Direct-mapped block goes in exactly one frame (1 frame per set)
43 Associativity In cache w/ 8 frames, where does block 12 (b 1100) go? Fully-associative block goes in any frame (all frames in 1 set) Direct-mapped block goes in exactly one frame (1 frame per set)
44 Associativity In cache w/ 8 frames, where does block 12 (b 1100) go? Fully-associative block goes in any frame (all frames in 1 set) Set-associative block goes in any frame in one set (frames grouped in sets) Direct-mapped block goes in exactly one frame (1 frame per set)
45 hit rate Associativity Larger associativity (for the same size) lower miss rate (fewer conflicts) higher power consumption Smaller associativity lower cost faster hit time 2:1 rule of thumb: for small caches (up to 128KB), 2-way assoc. gives same miss rate as direct-mapped twice the size holding cache and block size constant ~5 for L1-D associativity
46 N-Way Set-Associative Cache tag[50:16] index[15:6] block offset[5:0] set decoder data data data state state state tag tag tag decoder data data data state state state tag tag tag data state tag data state tag multiplexor = multiplexor = multiplexor Note the additional bit(s) moved from index to tag hit?
47 N-Way Set-Associative Cache tag[50:16] index[15:6] block offset[5:0] way set decoder data data data state state state tag tag tag decoder data data data state state state tag tag tag data state tag data state tag multiplexor = multiplexor = multiplexor Note the additional bit(s) moved from index to tag hit?
48 N-Way Set-Associative Cache tag[50:16] index[15:6] block offset[5:0] way set decoder data data data state state state tag tag tag decoder data data data state state state tag tag tag data state tag data state tag multiplexor = multiplexor = multiplexor Note the additional bit(s) moved from index to tag hit?
49 Fully-Associative Cache tag[50:6] block offset[5:0] Keep blocks in cache frames data state (e.g., valid) address tag state state state tag tag tag = = = data data data state tag = Content Addressable Memory (CAM) hit? data multiplexor
50 Cache block replacement algorithms Which block in a set to replace on a miss? Ideal replacement (Belady s Algorithm) Replace block accessed farthest in the future How do you implement it? Least Recently Used (LRU) Optimized for temporal locality (expensive for > 2-way associativity) Why? Not Most Recently Used (NMRU) Track MRU, random select among the rest Same as LRU for 2-sets Random Nearly as good as LRU, sometimes better (when?) Pseudo-LRU Used in caches with high associativity Examples: Tree-PLRU, Bit-PLRU
51 Tree-based Pseudo LRU Idea is to ensure you do not replace recently accessed block Not guaranteed to replace *least* recently used block Best effort to replace *least* recently used block For 2-way set associative keep 1 bit The bit indicates which line of the two has been reference more recently Replace which is not recently accessed For 4-way set associative need 3 bits
52 Tree-based pseudo LRU Access stream ABCDE, 4-way set-associative Recency vector 0 0 0
53 Tree-based pseudo LRU Access stream ABCDE, 4-way set-associative Recency vector A
54 Tree-based pseudo LRU Access stream ABCDE, 4-way set-associative Recency vector A A B
55 Tree-based pseudo LRU Access stream ABCDE, 4-way set-associative Recency vector A A B A B C
56 Tree-based pseudo LRU Access stream ABCDE, 4-way set-associative Recency vector A A B A B C A B D C
57 Tree-based pseudo LRU Access stream ABCDE, 4-way set-associative A 1 1 A 1 0 B A 0 0 B C 0 0 A 0 0 B C 1 0 D A 0 0 B C 1 D Access E Recency vector
58 Tree-based pseudo LRU Access stream ABCDE, 4-way set-associative A 1 1 A 1 0 B A 0 0 B C 0 0 A 0 0 B C 1 0 D A 0 0 B C 1 D Access E Recency vector
59 Tree based Pseudo-LRU Which is the next victim? E B D C What is the advantage of pseudo-lru?
60 Tree based Pseudo-LRU Which is the next victim? E B D C What is the advantage of pseudo-lru? A few bit flips to encode recency information as per position of insertion A bit vector lookup to determine victim Victim not always LRU but at not MRU either More often its close to LRU
61 Tree based Pseudo-LRU E 1 B 1 D 1 C Which is the next victim? C Is it true LRU? What is the advantage of pseudo-lru? A few bit flips to encode recency information as per position of insertion A bit vector lookup to determine victim Victim not always LRU but at not MRU either More often its close to LRU
62 Cache design issues 9/12/
63 1 Non-blocking caches Should later cache lookups stall if there is a miss? Can induce head-of-the-line stall due to earlier miss
64 1 Non-blocking caches Should later cache lookups stall if there is a miss? Can induce head-of-the-line stall due to earlier miss Observation 1: misses usually happen in bursts; it is helpful to overlap latencies of multiple parallel misses
65 1 Non-blocking caches Should later cache lookups stall if there is a miss? Can induce head-of-the-line stall due to earlier miss Observation 1: misses usually happen in bursts; it is helpful to overlap latencies of multiple parallel misses Observation 2: hit can happen while a miss is pending Particularly true for OOO execution Observation 3: main memory system can supports a large number of in-flight requests
66 1 Non-blocking caches Should later cache lookups stall if there is a miss? Can induce head-of-the-line stall due to earlier miss Observation 1: misses usually happen in bursts; it is helpful to overlap latencies of multiple parallel misses Observation 2: hit can happen while a miss is pending Particularly true for OOO execution Observation 3: main memory system can supports a large number of in-flight requests Idea: let s make caches non-blocking i.e., cache keeps accepting new requests while waiting for misses to be handled (how many new requests?) More parallelism in cache subsystem important for superscalar and OOO
67 Implementing Non-blocking Caches On a miss: Send the request to next level of cache/memory and Put the miss information in a Miss Status Holding Register (MSHR) Instruction tag (ROB#), address, load-or-store, store value, Why? Tag miss request to later cache/memory with MSHR entry ID
68 Implementing Non-blocking Caches On a miss: Send the request to next level of cache/memory and Put the miss information in a Miss Status Holding Register (MSHR) Instruction tag (ROB#), address, load-or-store, store value, Why? Tag miss request to later cache/memory with MSHR entry ID When memory response arrives:
69 Implementing Non-blocking Caches On a miss: Send the request to next level of cache/memory and Put the miss information in a Miss Status Holding Register (MSHR) Instruction tag (ROB#), address, load-or-store, store value, Why? Tag miss request to later cache/memory with MSHR entry ID When memory response arrives: Find the corresponding MSHR entry using the MSHR tag Merge memory response data with store value (if store miss) and write to cache Broadcast results on CDB (if load miss)
70 Implementing Non-blocking Caches If a new load/store request to an already missing line Can merge the new miss into existing MSHR Instead of sending another request to next cache level/main memory MSHR entries should be big enough to keep info for multiple pending misses to the same line
71 Implementing Non-blocking Caches If a new load/store request to an already missing line Can merge the new miss into existing MSHR Instead of sending another request to next cache level/main memory MSHR entries should be big enough to keep info for multiple pending misses to the same line How many MSHRs?
72 Implementing Non-blocking Caches If a new load/store request to an already missing line Can merge the new miss into existing MSHR Instead of sending another request to next cache level/main memory MSHR entries should be big enough to keep info for multiple pending misses to the same line How many MSHRs? Typically 10s. Depends upon targeted b/w and average latency of each miss
73 Implementing Non-blocking Caches If a new load/store request to an already missing line Can merge the new miss into existing MSHR Instead of sending another request to next cache level/main memory MSHR entries should be big enough to keep info for multiple pending misses to the same line How many MSHRs? Typically 10s. Depends upon targeted b/w and average latency of each miss Remember Little s law from first lecture? E.g., 11 at L1 level in current Intel Xeon (server) processors
74 2 Parallel vs. Serial Caches Tag and Data usually separate SRAMs tag is smaller & faster State bits stored along with tags Valid bit, LRU bit(s), Parallel access to tag and data reduces latency (good for L1) Serial access to tag and data reduces power (good for L2+) enable = = = = = = = = valid? valid? hit? data hit? data
75 3 Victim caches Associativity is expensive Performance overhead from extra muxes Power overhead from reading and checking more tags and data Conflicts are expensive Performance from extra misses Observation: Conflicts don t occur in all sets Idea: use a fully-associative victim cache to absorbs blocks displaced from the main cache Extend associativity of sets experiencing many conflict missses
76 Victim cache Access Sequence: E A B N M C K L D J 4-way Set-Associative L1 Cache DE A B C DC X Y Z MN J KJ KL ML P Q R Every access is a miss! ABCDE and JKLMN do not fit in a 4-way set associative cache Provide extra associativity, but not for all sets
77 Victim cache Access Sequence: 4-way Set-Associative L1 Cache 4-way Set-Associative L1 Cache + Fully-Associative Victim Cache E A B N M C K L D J DE A B C DC X Y Z MN J KJ KL ML NJ KJ KL ML P Q R P Q R Every access is a miss! ABCDE and JKLMN do not fit in a 4-way set associative cache AE BA CB DC X Y Z Victim cache provides a fifth way so long as only four sets overflow into it at the same time DC AB JKLM Can even provide 6 th or 7 th ways Provide extra associativity, but not for all sets
78 4 Cache inclusivity Core often accesses blocks not present in any $ Should block be allocated in L3, L2, and L1? Called Inclusive caches Waste of space Requires forced evict (e.g., force evict from L1 on evict from L2+) Only allocate blocks in L1 Called non-inclusive caches (why not exclusive?) Some processors combine both L3 is inclusive of L1 and L2 Why do inclusive caches? L2 is non-inclusive of L1 (like a large victim cache)
79 5 Write propagation When to propagate new value to (lower level) memory? Option #1: Write-through: immediately On hit, update cache Immediately send the write to the next level Option #2: Write-back: when block is replaced/evicted Requires additional dirty bit per block Replace clean block: no extra traffic Replace dirty block: extra writeback of block What are the trade-offs? 37
80 Write propagation comparison Write-through Requires additional bus bandwidth Consider repeated write hits Next level must handle small writes (1, 2, 4, 8-bytes) + No need for valid bits in cache + No need to handle writeback operations Simplifies miss handling (no WBB) Sometimes used for L1 caches (for example, by IBM) Write-back + Key advantage: uses less bandwidth Reverse of other pros/cons above Used by Intel and AMD Second-level and beyond are generally write-back caches 38
81 6 Write/store miss handling Should we bring the data to the cache on a write miss? For L1 cache store == write 39
82 Write miss handling How is a write miss actually handled? Write-allocate: fill block from next level, then write it + Decreases read misses (next read to block will hit) Requires additional bandwidth Commonly used (especially with write-back caches) Write-non-allocate: just write to next level, no allocate Potentially more read misses + Uses less bandwidth Use with write-through 40
83 Write misses latency Read miss? Load can t go on without the data, it must stall Write miss? What happens to the store instruction when it reaches the head of the ROB and it misses in the cache?
84 Write misses latency Read miss? Load can t go on without the data, it must stall Write miss? What happens to the store instruction when it reaches the head of the ROB and it misses in the cache? Stalls retirement of later instructions What can we do about it? Remember stores can be end of a dependence chain of instructions Is the latency to store in critical path?
85 Write Misses and Write Buffers Read miss? Load can t go on without the data, it must stall Write miss? Technically, if no instruction is waiting for data, why stall? Write buffer (a.k.a., store buffer): a small buffer How does it help? Write buffer vs. writeback-buffer Write buffer: in front of L1 D$, for hiding store misses Writeback buffer: behind L1 D$, for hiding writebacks Any possible issues? WBB Processor WB Cache Next-level cache 42
86 Write Misses and Write Buffers Read miss? Load can t go on without the data, it must stall Write miss? Technically, if no instruction is waiting for data, why stall? Write buffer (a.k.a., store buffer): a small buffer How does it help? Write buffer vs. writeback-buffer Write buffer: in front of L1 D$, for hiding store misses Writeback buffer: behind L1 D$, for hiding writebacks Any possible issues? Forwarding again?? WBB Processor WB Cache Next-level cache 42
87 7 Local vs Global Miss Rates Local hit/miss rate: Percent of references to cache hit (e.g, 90%) Local miss rate is (100% - local hit rate), (e.g., 10%) Global hit/miss rate: Misses per instruction (1 miss per 30 instructions) Instructions per miss (3% of instructions miss) Above assumes loads/stores are 1 in 3 instructions Consider second-level cache hit rate L1: 2 misses per 100 instructions L2: 1 miss per 100 instructions 43
88 7 Local vs Global Miss Rates Local hit/miss rate: Percent of references to cache hit (e.g, 90%) Local miss rate is (100% - local hit rate), (e.g., 10%) Global hit/miss rate: Misses per instruction (1 miss per 30 instructions) Instructions per miss (3% of instructions miss) Above assumes loads/stores are 1 in 3 instructions Consider second-level cache hit rate L1: 2 misses per 100 instructions L2: 1 miss per 100 instructions L2 local miss rate -> 50% 43
89 Techniques We ve Seen So Far Use Caching to reduce memory latency Use wide out-of-order execution to hide memory latency By overlapping misses with other useful work Cannot efficiently go much wider than several instructions Need a different strategy 44
90 Techniques We ve Seen So Far Use Caching to reduce memory latency Use wide out-of-order execution to hide memory latency By overlapping misses with other useful work Cannot efficiently go much wider than several instructions Neither is enough for server applications Not much locality (mostly accessing linked data structures) Not much ILP and MLP Server apps spend 50-66% of their time stalled on memory Need a different strategy 44
91 Prefetching 9/12/
92 Prefetching (1) Fetch data ahead of demand Why it could be useful?
93 Prefetching (1) Fetch data ahead of demand Why it could be useful? Can any caching technique avoid compulsory miss? Caching technique does not tell what could be needed next o Replacement policy predicts what in cache may not be required
94 Prefetching (1) Fetch data ahead of demand Why it could be useful? Can any caching technique avoid compulsory miss? Caching technique does not tell what could be needed next o Replacement policy predicts what in cache may not be required Main challenges: Knowing what to fetch Fetching useless blocks wastes resources Knowing when to fetch Too early pollutes cache or gets thrown out before use) Fetching too late defeats purpose of pre -fetching
95 Prefetching (1) Fetch data ahead of demand Why it could be useful? Can any caching technique avoid compulsory miss? Caching technique does not tell what could be needed next o Replacement policy predicts what in cache may not be required Main challenges: Knowing what to fetch Fetching useless blocks wastes resources Knowing when to fetch Too early pollutes cache or gets thrown out before use) Fetching too late defeats purpose of pre -fetching Prefetching must be accurate and timely
96 Prefetching (2) Without prefetching:
97 Prefetching (2) Without prefetching: Load time
98 Prefetching (2) Without prefetching: L1 Load time
99 Prefetching (2) Without prefetching: L1 L2 Load time
100 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data time
101 Prefetching (2) Without prefetching: L1 L2 DRAM Load time Total Load-to-Use Latency Data
102 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency
103 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch
104 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch
105 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Load
106 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Load Data
107 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Load Data Much improved Load-to-Use Latency
108 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Or: Load Data Much improved Load-to-Use Latency
109 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Or: Load Data Much improved Load-to-Use Latency Prefetch
110 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Or: Load Data Much improved Load-to-Use Latency Prefetch
111 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Or: Load Data Much improved Load-to-Use Latency Prefetch Load
112 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Or: Load Data Much improved Load-to-Use Latency Prefetch Load Data
113 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Or: Load Data Much improved Load-to-Use Latency Prefetch Load Data Somewhat improved Latency Prefetching must be accurate and timely
114 Types of Prefetching Software Hardware
115 Types of Prefetching Software By compiler By programmer Hardware
116 Types of Prefetching Software By compiler By programmer Hardware Next-Line, Adjacent-Line Next-N-Line Stream Buffers Stride Localized (PC-based) Pointer Correlation
117 Software Prefetching (1) Prefetch data using explicit instructions Inserted by compiler and/or programmer
118 Software Prefetching (1) Prefetch data using explicit instructions Inserted by compiler and/or programmer Put prefetched value into
119 Software Prefetching (1) Prefetch data using explicit instructions Inserted by compiler and/or programmer Put prefetched value into Register (binding prefetch) Also called hoisting (same as reordering instruction) Basically, just moving the load instruction up in the program
120 Software Prefetching (1) Prefetch data using explicit instructions Inserted by compiler and/or programmer Put prefetched value into Register (binding prefetch) Also called hoisting (same as reordering instruction) Basically, just moving the load instruction up in the program Cache (non-binding prefetch) Requires ISA support May get evicted from cache before demand
121 Software Prefetching (2) A B C R1 = [R2] R3 = R1+4 Hoisting is prone to many problems:
122 Software Prefetching (2) A R1 = R1-1 WAW Violated R1 = [R2] A R1 = R1-1 B C R1 = [R2] R3 = R1+4 B C R3 = R1+4 Hoisting is prone to many problems: Must be aware of dependences
123 Software Prefetching (2) A R1 = R1-1 WAW Violated R1 = [R2] A R1 = R1-1 B C R1 = [R2] R3 = R1+4 B C R3 = R1+4 Hoisting is prone to many problems: Must be aware of dependences Must not cause exceptions not possible in the original execution Increases register pressure for the compiler
124 Software Prefetching (2) A R1 = R1-1 WAW Violated R1 = [R2] A R1 = R1-1 PREFETCH[R2] A R1 = R1-1 B C R1 = [R2] R3 = R1+4 B C R3 = R1+4 B C R1 = [R2] R3 = R1+4 Hoisting is prone to many problems: Must be aware of dependences Must not cause exceptions not possible in the original execution Increases register pressure for the compiler Using a prefetch instruction can avoid all these problems
125 Software Prefetching (3) for (I = 1; I < rows; I++) { for (J = 1; J < columns; J++) { prefetch(&x[i+1,j]); sum = sum + x[i,j]; } } Prefetch instruction reads the containing block from the memory and puts it in the cache
126 Software Prefetching (4) Pros: Gives programmer control and flexibility Allows for complex (compiler) analysis No (major) hardware modifications needed
127 Software Prefetching (4) Pros: Gives programmer control and flexibility Allows for complex (compiler) analysis No (major) hardware modifications needed Cons: Prefetch instructions increase code footprint May cause more I$ misses, code alignment issues Hard to perform timely prefetches At IPC=2 and 100-cycle memory move load 200 inst. earlier Might not even have 200 inst. in current function Prefetching earlier and more often leads to low accuracy Program may go down a different path (block B in prev. slides)
128 Hardware Prefetching Hardware monitors memory accesses Looks for common patterns Makes predictions Predicted addresses are placed into prefetch queue Queue is checked when no demand accesses waiting Prefetches look like load requests to the mem. hierarchy
129 Hardware Prefetching Hardware monitors memory accesses Looks for common patterns Makes predictions Predicted addresses are placed into prefetch queue Queue is checked when no demand accesses waiting Prefetches look like load requests to the mem. hierarchy Prefetchers may trade bandwidth for latency Extra bandwidth used only when guessing incorrectly Latency reduced only when guessing correctly
130 Hardware Prefetching Hardware monitors memory accesses Looks for common patterns Makes predictions Predicted addresses are placed into prefetch queue Queue is checked when no demand accesses waiting Prefetches look like load requests to the mem. hierarchy Prefetchers may trade bandwidth for latency Extra bandwidth used only when guessing incorrectly Latency reduced only when guessing correctly No need to change software
131 Hardware Prefetcher Design Space What to prefetch? Predict regular patterns (x, x+8, x+16, ) Predict correlated patterns (A..B->C, B..C->J, A..C->K, ) When to prefetch? Where to put prefetched data?
132 Hardware Prefetcher Design Space What to prefetch? Predict regular patterns (x, x+8, x+16, ) Predict correlated patterns (A..B->C, B..C->J, A..C->K, ) When to prefetch? On every reference lots of lookup/prefetch overhead On every miss patterns filtered by caches On prefetched-data hits (positive feedback) Where to put prefetched data?
133 Hardware Prefetcher Design Space What to prefetch? Predict regular patterns (x, x+8, x+16, ) Predict correlated patterns (A..B->C, B..C->J, A..C->K, ) When to prefetch? On every reference lots of lookup/prefetch overhead On every miss patterns filtered by caches On prefetched-data hits (positive feedback) Where to put prefetched data? Prefetch buffers Caches
134 Prefetching at Different Levels Processor I-TLB Registers L1 I-Cache L1 D-Cache D-TLB Intel Core2 Prefetcher Locations L2 Cache L3 Cache (LLC) Real CPUs have multiple prefetchers w/ different strategies Usually closer to the core (easier to detect patterns) Prefetching at LLC is hard. Why?
135 Prefetching at Different Levels Processor I-TLB Registers L1 I-Cache L1 D-Cache D-TLB Intel Core2 Prefetcher Locations L2 Cache L3 Cache (LLC) Real CPUs have multiple prefetchers w/ different strategies Usually closer to the core (easier to detect patterns) Prefetching at LLC is hard. Why? Mixed access patterns from many cores in shared LLC Typically heavily banked (why this is a problem?)
136 Next-Line (or Adjacent-Line) Prefetching On request for line X, prefetch X+1 Assumes spatial locality why not just increase block size? Should stop at physical (OS) page boundaries (why?)
137 Next-Line (or Adjacent-Line) Prefetching On request for line X, prefetch X+1 Assumes spatial locality why not just increase block size? Should stop at physical (OS) page boundaries (why?) Works for I$ and D$ Instructions execute sequentially Large data structures often span multiple blocks
138 Next-Line (or Adjacent-Line) Prefetching On request for line X, prefetch X+1 Assumes spatial locality why not just increase block size? Should stop at physical (OS) page boundaries (why?) Works for I$ and D$ Instructions execute sequentially Large data structures often span multiple blocks Simple, but usually not timely
139 Next-N-Line Prefetching On request for line X, prefetch X+1, X+2,, X+N N is called prefetch depth or prefetch degree
140 Next-N-Line Prefetching On request for line X, prefetch X+1, X+2,, X+N N is called prefetch depth or prefetch degree Must carefully tune depth N. Large N is more likely to be timely (always?) more aggressive more likely to make a mistake Might evict something useful more expensive need storage for prefetched lines Might delay useful request on interconnect or port
141 Next-N-Line Prefetching On request for line X, prefetch X+1, X+2,, X+N N is called prefetch depth or prefetch degree Must carefully tune depth N. Large N is more likely to be timely (always?) more aggressive more likely to make a mistake Might evict something useful more expensive need storage for prefetched lines Might delay useful request on interconnect or port Still simple, but more timely than Next-Line
142 Stride Prefetching (1) Column in matrix Elements in array of structs Access patterns often follow a stride Example 1: Accessing column of elements in a matrix Example 2: Accessing elements in array of structs
143 Stride Prefetching (1) Column in matrix Elements in array of structs Access patterns often follow a stride Example 1: Accessing column of elements in a matrix Example 2: Accessing elements in array of structs Detect stride S, prefetch depth N Prefetch X+S, X+2S,, X+NS
144 Stride Prefetching (2) Must carefully select depth N Same constraints as Next-N-Line prefetcher How to tell the diff. between A[i] A[i+1] and X Y? Wait until you see the same stride a few times Can vary prefetch depth based on confidence More consecutive strided accesses higher confidence New access to A+3S Last Addr Stride Count A+2S S 2 >2 Do prefetch? = + + A+4S (addr to prefetch) Update count
145 Localized Stride Prefetchers (1) What if multiple strides are interleaved? Y = A + X? Load R1 = [R2] Load R3 = [R4] Add R5, R1, R3 Store [R6] = R5 A, X, Y, A+S0, X+S1, Y+S2, A+2S0, X+2S1, Y+2S2,
146 Localized Stride Prefetchers (1) What if multiple strides are interleaved? No clearly-discernible stride Y = A + X? Load R1 = [R2] Load R3 = [R4] Add R5, R1, R3 Store [R6] = R5 A, X, Y, A+S0, X+S1, Y+S2, A+2S0, X+2S1, Y+2S2, (X-A) (Y-X) (X-A) (Y-X) (X-A) (Y-X) (A+S-Y) (A+S-Y) (A+S-Y)
147 Localized Stride Prefetchers (1) What if multiple strides are interleaved? No clearly-discernible stride Y = A + X? Load R1 = [R2] Load R3 = [R4] Add R5, R1, R3 Store [R6] = R5 A, X, Y, A+S0, X+S1, Y+S2, A+2S0, X+2S1, Y+2S2, (X-A) (Y-X) (X-A) (Y-X) (X-A) (Y-X) (A+S-Y) (A+S-Y) (A+S-Y) Observation: Accesses to structures usually corelates to an instruction PC
148 Localized Stride Prefetchers (1) What if multiple strides are interleaved? No clearly-discernible stride Y = A + X? Load R1 = [R2] Load R3 = [R4] Add R5, R1, R3 Store [R6] = R5 A, X, Y, A+S0, X+S1, Y+S2, A+2S0, X+2S1, Y+2S2, (X-A) (Y-X) (X-A) (Y-X) (X-A) (Y-X) (A+S-Y) (A+S-Y) (A+S-Y) Observation: Accesses to structures usually corelates to an instruction PC Idea: Use an array of strides, indexed by PC
149 Localized Stride Prefetchers (2) Store PC, last address, last stride, and count in a Reference Prediction Table (RPT) On access, check RPT Same stride? count++ if yes, count-- or count=0 if no If count is high, prefetch (last address + stride) PC: 0x409A34 PC: 0x409A38 PC: 0x409A40 Load R1 = [R2] Load R3 = [R4] Store [R6] = R5 Tag 0x409 0x409 Last Addr Stride Count A+3S S0 2 + X+3S1 S1 2 If confident about the stride (count > C min ), prefetch (A+4S) 0x409 Y+2S2 S2 1
150 Other Prefetch Patterns Sometimes accesses are highly predictable, but no strides Linked data structures (e.g., lists or trees) A B C D E F Linked-list traversal
151 Other Prefetch Patterns Sometimes accesses are highly predictable, but no strides Linked data structures (e.g., lists or trees) A B C D E F Linked-list traversal D A F B C Actual memory layout (no chance to detect a stride) E
152 Pointer Prefetching (1) Data filled on cache miss (512 bits of data) Which ones are pointers? Nope Nope Nope Nope Maybe! Maybe! Nope Nope Pointers usually look different
153 Pointer Prefetching (1) Data filled on cache miss (512 bits of data) Which ones are pointers? Nope Nope Nope Nope Maybe! Maybe! Nope Nope Go ahead and prefetch these (needs some help from the TLB) Pointers usually look different
154 Pointer Prefetching (1) Data filled on cache miss (512 bits of data) Which ones are pointers? Nope Nope Nope Nope Maybe! Maybe! Nope Nope struct bintree_node_t { int data1; int data2; struct bintree_node_t * left; struct bintree_node_t * right; }; Go ahead and prefetch these (needs some help from the TLB) This allows you to walk the tree (or other pointer-based data structures which are typically hard to prefetch) Pointers usually look different
155 Pointer Prefetching (2) Relatively cheap to implement Don t need extra hardware to store patterns But can fetch a lot of junk Limited lookahead makes timely prefetches hard Can t get next pointer until fetched data block What is the prefetch depth? Stride Prefetcher: X X+S Access Latency Access Latency X+2S Access Latency Pointer Prefetcher: A Access Latency B Access Latency C Access Latency
156 Pair-wise Temporal Correlation (1) Accesses exhibit temporal correlation If E followed D in the past if we see D, prefetch E Somewhat similar to history-based branch prediction Correlation Table Linked-list traversal D F A D F E? A B C D E F Actual memory layout B A B 11 D F C E B C C D E A B C E F 01 Can use recursively to get more lookahead
157 Pair-wise Temporal Correlation (2) Many patterns more complex than linked lists Can be represented by a Markov Model Required tracking multiple potential successors Number of candidates is called breadth Correlation Table A B C.67 Markov Model D E F Recursive breadth & depth grows exponentially.5 D F A B C E D F A B C E C E B C D A E? C? F?
158 Increasing Correlation History Length Like branch prediction, longer history can provide more accuracy (a.k.a, looking for sequence of accesses) And increases training time Use history hash for lookup E.g., XOR the bits of the addresses of the last K accesses Tree traversal: ABDBEBACFCGCA A B A B C B D D B B E F E D A D E F G E B B B B A C A C Better accuracy, larger storage cost
159 Memory interface Stream Buffers (1) Used to avoid cache pollution caused by deep prefetching FIFO Each buffer holds one stream of sequentially prefetched lines Keep next-n available in buffer On a load miss, check the head of all buffers if match, pop the entry from FIFO, fetch the N+1 st line into the buffer if miss, allocate a new stream buffer (use LRU for recycling) Cache FIFO FIFO FIFO
160 Stream Buffers (2) Can incorporate stride prediction mechanisms to support non-unit-stride streams Can extend to quasi-sequential stream buffer On request Y in [X X+N], advance by Y-X+1 Allows buffer to work when items are skipped Requires expensive (associative) comparison
161 Example: Prefetchers in Intel Sandy Bridge Data L1 1) PC-localized stride prefetcher 2) Next line prefetcher Only on an ascending access to very recently loaded data L2 3) Stream buffer/prefetcher: Prefetch the cache line which pairs with current one to make a 128-byte aligned chunk 4) Stream prefetcher: detects streams of requests made by L1 (I and D) and prefetches lines down the stream # of lines to prefetch depends on # of outstanding requests from L1 Far lines are only prefetched to L3; closer ones are brought to L2 See Intel Architectures Optimization Reference Manual for more details
162 Metrics for prefetching Coverage = # of $ miss eliminated by prefetching) / # of total $ miss # total cache miss = # of misses eliminated by prefetching + # of misses not eliminated by prefetching Accuracy = # of misses eliminated by prefetching/ (# of useless prefetch + # of misses eliminated by prefetching) Timeliness, a qualitative metric 9/12/
Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand
Cache Design Basics Nima Honarmand Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled
More informationSpring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand
Caches Nima Honarmand Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required by all of the running applications
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required
More informationComputer Architecture Spring 2016
omputer Architecture Spring 2016 Lecture 09: Prefetching Shuai Wang Department of omputer Science and Technology Nanjing University Prefetching(1/3) Fetch block ahead of demand Target compulsory, capacity,
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 07: Caches II Shuai Wang Department of Computer Science and Technology Nanjing University 63 address 0 [63:6] block offset[5:0] Fully-AssociativeCache Keep blocks
More informationAdvanced Caching Techniques (2) Department of Electrical Engineering Stanford University
Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15
More information10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache
Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is
More informationWhy memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho
Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide
More informationEECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table
Lecture 15 History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2018 Jon Beaumont A1 http://www.eecs.umich.edu/courses/eecs470 Prefetch A3 Slides developed in part by Profs.
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =
More informationPage 1. Multilevel Memories (Improving performance using a little cash )
Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationAnnouncements. ! Previous lecture. Caches. Inf3 Computer Architecture
Announcements! Previous lecture Caches Inf3 Computer Architecture - 2016-2017 1 Recap: Memory Hierarchy Issues! Block size: smallest unit that is managed at each level E.g., 64B for cache lines, 4KB for
More informationCHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN
CHAPTER 4 TYPICAL MEMORY HIERARCHY MEMORY HIERARCHIES MEMORY HIERARCHIES CACHE DESIGN TECHNIQUES TO IMPROVE CACHE PERFORMANCE VIRTUAL MEMORY SUPPORT PRINCIPLE OF LOCALITY: A PROGRAM ACCESSES A RELATIVELY
More informationCS3350B Computer Architecture
CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationCPU issues address (and data for write) Memory returns data (or acknowledgment for write)
The Main Memory Unit CPU and memory unit interface Address Data Control CPU Memory CPU issues address (and data for write) Memory returns data (or acknowledgment for write) Memories: Design Objectives
More information1/19/2009. Data Locality. Exploiting Locality: Caches
Spring 2009 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Data Locality Temporal: if data item needed now, it is likely to be needed again in near future Spatial: if data item needed now, nearby
More informationLecture 7 - Memory Hierarchy-II
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw
More informationLecture notes for CS Chapter 2, part 1 10/23/18
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationEECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont
Lecture 14 Advanced Caches DEC Alpha Fall 2018 Instruction Cache BIU Jon Beaumont www.eecs.umich.edu/courses/eecs470/ Data Cache Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
More informationCSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1
CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson
More informationCS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III
CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste
More informationEECS 470 Lecture 13. Basic Caches. Fall 2018 Jon Beaumont
Basic Caches Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of
More informationFall 2007 Prof. Thomas Wenisch
Basic Caches Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar
More informationCS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!
More informationCache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals
Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics
More informationCache Performance (H&P 5.3; 5.5; 5.6)
Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationCaches Concepts Review
Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to
More informationEE 4683/5683: COMPUTER ARCHITECTURE
EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major
More informationChapter 5A. Large and Fast: Exploiting Memory Hierarchy
Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM
More informationAdvanced Computer Architecture
Advanced Computer Architecture 1 L E C T U R E 2: M E M O R Y M E M O R Y H I E R A R C H I E S M E M O R Y D A T A F L O W M E M O R Y A C C E S S O P T I M I Z A T I O N J A N L E M E I R E Memory Just
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationPage 1. The Problem. Caches. Balancing Act. The Solution. Widening memory gap. As always. Deepen memory hierarchy
The Problem Caches Today s topics: Basics memory hierarchy locality cache models associative options calculating miss penalties some fundamental optimization issues Widening memory gap DRAM latency CAGR
More informationECE 571 Advanced Microprocessor-Based Design Lecture 10
ECE 571 Advanced Microprocessor-Based Design Lecture 10 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 2 October 2014 Performance Concerns Caches Almost all programming can be
More informationCS3350B Computer Architecture
CS3350B Computer Architecture Winter 2015 Lecture 3.1: Memory Hierarchy: What and Why? Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and Design, Patterson
More informationMultilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per
More informationLRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.
LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Cache Organization Prof. Michel A. Kinsy The course has 4 modules Module 1 Instruction Set Architecture (ISA) Simple Pipelining and Hazards Module 2 Superscalar Architectures
More informationCaches. Samira Khan March 23, 2017
Caches Samira Khan March 23, 2017 Agenda Review from last lecture Data flow model Memory hierarchy More Caches The Dataflow Model (of a Computer) Von Neumann model: An instruction is fetched and executed
More informationPage 1. Memory Hierarchies (Part 2)
Memory Hierarchies (Part ) Outline of Lectures on Memory Systems Memory Hierarchies Cache Memory 3 Virtual Memory 4 The future Increasing distance from the processor in access time Review: The Memory Hierarchy
More informationCS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste
More informationLECTURE 11. Memory Hierarchy
LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationMemory Hierarchy. Slides contents from:
Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory
More informationCSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]
CSF Improving Cache Performance [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user
More informationLecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)
Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1) 1 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed the misses for an infinite cache
More informationLecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)
Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 More Cache Basics caches are split as instruction and data; L2 and L3 are unified The /L2 hierarchy can be inclusive,
More informationCS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook
CS356: Discussion #9 Memory Hierarchy and Caches Marco Paolieri (paolieri@usc.edu) Illustrations from CS:APP3e textbook The Memory Hierarchy So far... We modeled the memory system as an abstract array
More informationCS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!
More informationComputer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James
Computer Systems Architecture I CSE 560M Lecture 17 Guest Lecturer: Shakir James Plan for Today Announcements and Reminders Project demos in three weeks (Nov. 23 rd ) Questions Today s discussion: Improving
More informationKey Point. What are Cache lines
Caching 1 Key Point What are Cache lines Tags Index offset How do we find data in the cache? How do we tell if it s the right data? What decisions do we need to make in designing a cache? What are possible
More informationAdvanced Computer Architecture
ECE 563 Advanced Computer Architecture Fall 2009 Lecture 3: Memory Hierarchy Review: Caches 563 L03.1 Fall 2010 Since 1980, CPU has outpaced DRAM... Four-issue 2GHz superscalar accessing 100ns DRAM could
More informationEITF20: Computer Architecture Part4.1.1: Cache - 2
EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss
More informationPerformance metrics for caches
Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache / total number of memory references Typically h = 0.90 to 0.97 Equivalent metric:
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationLecture 11. Virtual Memory Review: Memory Hierarchy
Lecture 11 Virtual Memory Review: Memory Hierarchy 1 Administration Homework 4 -Due 12/21 HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache size, block size, associativity
More informationComputer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM
Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Memory / DRAM SRAM = Static RAM SRAM vs. DRAM As long as power is present, data is retained DRAM = Dynamic RAM If you don t do anything, you lose the data SRAM: 6T per bit
More informationMemories. CPE480/CS480/EE480, Spring Hank Dietz.
Memories CPE480/CS480/EE480, Spring 2018 Hank Dietz http://aggregate.org/ee480 What we want, what we have What we want: Unlimited memory space Fast, constant, access time (UMA: Uniform Memory Access) What
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address
More informationCourse Administration
Spring 207 EE 363: Computer Organization Chapter 5: Large and Fast: Exploiting Memory Hierarchy - Avinash Kodi Department of Electrical Engineering & Computer Science Ohio University, Athens, Ohio 4570
More informationMemory Hierarchy I - Cache
CSEng120 Computer Architecture Memory Hierarchy I - Cache RHH 1 Cache Application OS Compiler Firmware Instruction Set Architecture CPU I/O Memory Digital Circuits Memory hierarchy concepts Cache organization
More informationChapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.
Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance
More informationShow Me the $... Performance And Caches
Show Me the $... Performance And Caches 1 CPU-Cache Interaction (5-stage pipeline) PCen 0x4 Add bubble PC addr inst hit? Primary Instruction Cache IR D To Memory Control Decode, Register Fetch E A B MD1
More informationThe Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)
The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache
More informationCS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15
CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2017 Lecture 15 LAST TIME: CACHE ORGANIZATION Caches have several important parameters B = 2 b bytes to store the block in each cache line S = 2 s cache sets
More informationLecture 2: Caches. belongs to Milo Martin, Amir Roth, David Wood, James Smith, Mikko Lipasti
Lecture 2: Caches http://wwwcssfuca/~ashriram/cs885/ belongs to Milo Martin, Amir Roth, David Wood, James Smith, Mikko Lipasti 1 Why focus on caches and memory? CPU can only compute as fast as memory Add
More informationChapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST
Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial
More informationCache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance
6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per
More informationEN1640: Design of Computing Systems Topic 06: Memory System
EN164: Design of Computing Systems Topic 6: Memory System Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University Spring
More informationMemory hierarchy review. ECE 154B Dmitri Strukov
Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Six basic optimizations Virtual memory Cache performance Opteron example Processor-DRAM gap in latency Q1. How to deal
More informationComputer Organization and Structure. Bing-Yu Chen National Taiwan University
Computer Organization and Structure Bing-Yu Chen National Taiwan University Large and Fast: Exploiting Memory Hierarchy The Basic of Caches Measuring & Improving Cache Performance Virtual Memory A Common
More informationLECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY
LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal
More informationMemory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB
Memory Technology Caches 1 Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per GB Ideal memory Average access time similar
More informationCS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck
Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find
More informationSpring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand
Main Memory & DRAM Nima Honarmand Main Memory Big Picture 1) Last-level cache sends its memory requests to a Memory Controller Over a system bus of other types of interconnect 2) Memory controller translates
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationLecture 11 Cache. Peng Liu.
Lecture 11 Cache Peng Liu liupeng@zju.edu.cn 1 Associative Cache Example 2 Associative Cache Example 3 Associativity Example Compare 4-block caches Direct mapped, 2-way set associative, fully associative
More informationEITF20: Computer Architecture Part 5.1.1: Virtual Memory
EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache
More informationPortland State University ECE 587/687. Caches and Memory-Level Parallelism
Portland State University ECE 587/687 Caches and Memory-Level Parallelism Revisiting Processor Performance Program Execution Time = (CPU clock cycles + Memory stall cycles) x clock cycle time For each
More informationChapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY
Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored
More informationMo Money, No Problems: Caches #2...
Mo Money, No Problems: Caches #2... 1 Reminder: Cache Terms... Cache: A small and fast memory used to increase the performance of accessing a big and slow memory Uses temporal locality: The tendency to
More informationCaches 3/23/17. Agenda. The Dataflow Model (of a Computer)
Agenda Caches Samira Khan March 23, 2017 Review from last lecture Data flow model Memory hierarchy More Caches The Dataflow Model (of a Computer) Von Neumann model: An instruction is fetched and executed
More informationFall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic
Fall 2011 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Reading: Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000) If memory
More informationAgenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File
EE 260: Introduction to Digital Design Technology Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa 2 Technology Naive Register File Write Read clk Decoder Read Write 3 4 Arrays:
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More information5 Solutions. Solution a. no solution provided. b. no solution provided
5 Solutions Solution 5.1 5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.1.6 S2 Chapter 5 Solutions Solution 5.2 5.2.1 4 5.2.2 a. I, J b. B[I][0] 5.2.3 a. A[I][J] b. A[J][I] 5.2.4 a. 3596 = 8 800/4 2 8 8/4 + 8000/4 b.
More informationMemory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology
Memory Hierarchies Instructor: Dmitri A. Gusev Fall 2007 CS 502: Computers and Communications Technology Lecture 10, October 8, 2007 Memories SRAM: value is stored on a pair of inverting gates very fast
More informationMemory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt
Memory Hierarchy 2/18/2016 CS 152 Sec6on 5 Colin Schmidt Agenda Review Memory Hierarchy Lab 2 Ques6ons Return Quiz 1 Latencies Comparison Numbers L1 Cache 0.5 ns L2 Cache 7 ns 14x L1 cache Main Memory
More informationCaching Basics. Memory Hierarchies
Caching Basics CS448 1 Memory Hierarchies Takes advantage of locality of reference principle Most programs do not access all code and data uniformly, but repeat for certain data choices spatial nearby
More informationEITF20: Computer Architecture Part4.1.1: Cache - 2
EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
More information