Caches and Prefetching

Size: px
Start display at page:

Download "Caches and Prefetching"

Transcription

1 Caches and Prefetching

2 Acknowledgements Several of the slides in the deck are from Luis Ceze (Washington), Nima Horanmand (Stony Brook), Mark Hill, David Wood, Karu Sankaralingam (Wisconsin), Abhishek Bhattacharjee(Rutgers). Development of this course is partially supported by Western Digital Corporation. 8/5/2018 2

3 Performance Motivation Processor 10 Memory Want memory to appear: As fast as CPU As large as required by all of the running applications

4 Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Registers Caches (SRAM) Memory (DRAM) Controlled by Hardware Controlled by Software (OS) [SSD? (Flash)] Disk (Magnetic Media)

5 Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled by Hardware Larger Cheaper Caches (SRAM) Memory (DRAM) Faster Controlled by Software (OS) [SSD? (Flash)] Disk (Magnetic Media)

6 Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled by Hardware Larger Cheaper Caches (SRAM) Memory (DRAM) Faster Controlled by Software (OS) [SSD? (Flash)] Disk (Magnetic Media)

7 Caches An automatically managed by hardware Break memory into blocks (typically 64 bytes) and transfer data to/from cache in blocks To exploit spatial locality Core $ Keep recently accessed blocks To exploit temporal locality Both locality principles are typical program behavior Memory

8 Cache Terminology block (cache line): minimum unit that may be cached frame: cache storage location to hold one block hit: block is found in the cache miss: block is not found in the cache miss ratio: fraction of references that miss hit time: time to access the cache miss penalty: time to retrieve block on a miss

9 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x x10120 Memory Final miss ratio is 50%

10 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x10004 Miss 0x10120 Memory Final miss ratio is 50%

11 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x x10120 Miss 0x10000 ( data ) Memory Final miss ratio is 50%

12 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x x10120 Miss 0x10000 ( data ) Memory Final miss ratio is 50%

13 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x x10120 Miss Hit 0x10000 ( data ) Memory Final miss ratio is 50%

14 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x x10120 Miss Hit Miss 0x10000 ( data ) Memory Final miss ratio is 50%

15 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x x10120 Miss Hit Miss 0x10000 ( data ) 0x10120 ( data ) Memory Final miss ratio is 50%

16 Cache Example Address sequence from core: (assume 8-byte lines) Core 0x x x10120 Miss Hit Miss 0x10000 ( data ) 0x10120 ( data ) Memory Final miss ratio is 50%

17 Typical memory hierarchy L1 is usually split separate I$ (inst. cache) and D$ (data cache). L2 and L3 are unified Why multi-level cache? Processor Registers I-TLB L1 I-Cache L1 D-Cache D-TLB L2 Cache L3 Cache (LLC) Main Memory (DRAM)

18 Average Memory Access Time Or AMAT = Hit-time + Miss-rate Miss-penalty If L1 cache hit is 5 cycles (core to L1 and back) L2 cache hit is 20 cycles (core to L2 and back) memory access is 100 cycles (L2 miss penalty) Then at 20% miss ratio in L1 and 40% miss ratio in L2 avg. access: ( ) = 15.4

19 Typical memory hierarchy L1 and L2 are private L3 is shared (why shared?) Processor Core 0 Registers Core 1 Registers I-TLB L1 I-Cache L1 D-Cache D-TLB I-TLB L1 I-Cache L1 D-Cache D-TLB L2 Cache L2 Cache L3 Cache (LLC) Main Memory (DRAM) [Later] Multi-core replicates the top of the hierarchy

20 Intel Nehalem (3.3GHz, 4 cores, 2 threads per core) Memory hierarchy 32K L1-D 256K L2 32K L1-I

21 How to Build a Cache

22 SRAM Overview 1 0 Chained inverters maintain a stable state

23 SRAM Overview 0 1 Chained inverters maintain a stable state

24 SRAM Overview b b Chained inverters maintain a stable state Access gates provide access to the cell

25 SRAM Overview b b Chained inverters maintain a stable state Access gates provide access to the cell Writing to cell involves over-powering storage inverters

26 SRAM Overview b Chained inverters maintain a stable state Access gates provide access to the cell Writing to cell involves over-powering storage inverters b 6T SRAM cell 2 access gates 2T per inverter

27 8-bit SRAM Array

28 8-bit SRAM Array wordline

29 8-bit SRAM Array wordline bitlines

30 8 8-bit SRAM Array 3 / 1-of-8 decoder

31 8 8-bit SRAM Array 3 / 1-of-8 decoder wordline

32 8 8-bit SRAM Array 3 / 1-of-8 decoder wordline bitlines

33 Direct-Mapped Cache using SRAM Use middle bits as index. Why? tag[50:16] index[15:6] block offset[5:0] decoder Only one tag comparison data data data state state state tag tag tag data state tag multiplexor = tag match hit? Why take index bits out of the middle?

34 Improving Cache Performance Recall AMAT formula: AMAT = Hit-time + Miss-rate Miss-penalty To improve cache performance, we can improve any of the three components Let s start by reducing miss rate

35 The 4 C s of Cache Misses Compulsory: Never accessed before Capacity: Accessed long ago and already replaced because cache too small Conflict: Neither compulsory nor capacity, because of limited associativity Coherence: (Will discuss in multi-processor lectures)

36 hit rate Cache size Cache size is data capacity (don t count tag and state) Bigger can exploit temporal locality better Not always better Too large a cache Smaller is faster bigger is slower Access time may hurt critical path Too small a cache Limited temporal locality Useful data constantly replaced capacity working set size

37 hit rate Block size Block size is the data that is: associated with an address tag not necessarily the unit of transfer between hierarchies Too small a block Don t exploit spatial locality well Excessive tag overhead Too large a block Useless data transferred Too few total blocks Useful data frequently replaced Common Block Sizes are bytes block size

38 Cache Conflicts What if two blocks alias on a frame? Same index, but different tags Address sequence: 0xDEADBEEF xFEEDBEEF xDEADBEEF tag index block offset 0xDEADBEEF experiences a Conflict miss Not Compulsory (seen it before) Not Capacity (lots of other frames available in cache)

39 Cache Conflicts What if two blocks alias on a frame? Same index, but different tags Address sequence: 0xDEADBEEF xFEEDBEEF xDEADBEEF tag index block offset 0xDEADBEEF experiences a Conflict miss Not Compulsory (seen it before) Not Capacity (lots of other frames available in cache)

40 Cache Conflicts What if two blocks alias on a frame? Same index, but different tags Address sequence: 0xDEADBEEF xFEEDBEEF xDEADBEEF tag index block offset 0xDEADBEEF experiences a Conflict miss Not Compulsory (seen it before) Not Capacity (lots of other frames available in cache)

41 Associativity In cache w/ 8 frames, where does block 12 (b 1100) go?

42 Associativity In cache w/ 8 frames, where does block 12 (b 1100) go? Direct-mapped block goes in exactly one frame (1 frame per set)

43 Associativity In cache w/ 8 frames, where does block 12 (b 1100) go? Fully-associative block goes in any frame (all frames in 1 set) Direct-mapped block goes in exactly one frame (1 frame per set)

44 Associativity In cache w/ 8 frames, where does block 12 (b 1100) go? Fully-associative block goes in any frame (all frames in 1 set) Set-associative block goes in any frame in one set (frames grouped in sets) Direct-mapped block goes in exactly one frame (1 frame per set)

45 hit rate Associativity Larger associativity (for the same size) lower miss rate (fewer conflicts) higher power consumption Smaller associativity lower cost faster hit time 2:1 rule of thumb: for small caches (up to 128KB), 2-way assoc. gives same miss rate as direct-mapped twice the size holding cache and block size constant ~5 for L1-D associativity

46 N-Way Set-Associative Cache tag[50:16] index[15:6] block offset[5:0] set decoder data data data state state state tag tag tag decoder data data data state state state tag tag tag data state tag data state tag multiplexor = multiplexor = multiplexor Note the additional bit(s) moved from index to tag hit?

47 N-Way Set-Associative Cache tag[50:16] index[15:6] block offset[5:0] way set decoder data data data state state state tag tag tag decoder data data data state state state tag tag tag data state tag data state tag multiplexor = multiplexor = multiplexor Note the additional bit(s) moved from index to tag hit?

48 N-Way Set-Associative Cache tag[50:16] index[15:6] block offset[5:0] way set decoder data data data state state state tag tag tag decoder data data data state state state tag tag tag data state tag data state tag multiplexor = multiplexor = multiplexor Note the additional bit(s) moved from index to tag hit?

49 Fully-Associative Cache tag[50:6] block offset[5:0] Keep blocks in cache frames data state (e.g., valid) address tag state state state tag tag tag = = = data data data state tag = Content Addressable Memory (CAM) hit? data multiplexor

50 Cache block replacement algorithms Which block in a set to replace on a miss? Ideal replacement (Belady s Algorithm) Replace block accessed farthest in the future How do you implement it? Least Recently Used (LRU) Optimized for temporal locality (expensive for > 2-way associativity) Why? Not Most Recently Used (NMRU) Track MRU, random select among the rest Same as LRU for 2-sets Random Nearly as good as LRU, sometimes better (when?) Pseudo-LRU Used in caches with high associativity Examples: Tree-PLRU, Bit-PLRU

51 Tree-based Pseudo LRU Idea is to ensure you do not replace recently accessed block Not guaranteed to replace *least* recently used block Best effort to replace *least* recently used block For 2-way set associative keep 1 bit The bit indicates which line of the two has been reference more recently Replace which is not recently accessed For 4-way set associative need 3 bits

52 Tree-based pseudo LRU Access stream ABCDE, 4-way set-associative Recency vector 0 0 0

53 Tree-based pseudo LRU Access stream ABCDE, 4-way set-associative Recency vector A

54 Tree-based pseudo LRU Access stream ABCDE, 4-way set-associative Recency vector A A B

55 Tree-based pseudo LRU Access stream ABCDE, 4-way set-associative Recency vector A A B A B C

56 Tree-based pseudo LRU Access stream ABCDE, 4-way set-associative Recency vector A A B A B C A B D C

57 Tree-based pseudo LRU Access stream ABCDE, 4-way set-associative A 1 1 A 1 0 B A 0 0 B C 0 0 A 0 0 B C 1 0 D A 0 0 B C 1 D Access E Recency vector

58 Tree-based pseudo LRU Access stream ABCDE, 4-way set-associative A 1 1 A 1 0 B A 0 0 B C 0 0 A 0 0 B C 1 0 D A 0 0 B C 1 D Access E Recency vector

59 Tree based Pseudo-LRU Which is the next victim? E B D C What is the advantage of pseudo-lru?

60 Tree based Pseudo-LRU Which is the next victim? E B D C What is the advantage of pseudo-lru? A few bit flips to encode recency information as per position of insertion A bit vector lookup to determine victim Victim not always LRU but at not MRU either More often its close to LRU

61 Tree based Pseudo-LRU E 1 B 1 D 1 C Which is the next victim? C Is it true LRU? What is the advantage of pseudo-lru? A few bit flips to encode recency information as per position of insertion A bit vector lookup to determine victim Victim not always LRU but at not MRU either More often its close to LRU

62 Cache design issues 9/12/

63 1 Non-blocking caches Should later cache lookups stall if there is a miss? Can induce head-of-the-line stall due to earlier miss

64 1 Non-blocking caches Should later cache lookups stall if there is a miss? Can induce head-of-the-line stall due to earlier miss Observation 1: misses usually happen in bursts; it is helpful to overlap latencies of multiple parallel misses

65 1 Non-blocking caches Should later cache lookups stall if there is a miss? Can induce head-of-the-line stall due to earlier miss Observation 1: misses usually happen in bursts; it is helpful to overlap latencies of multiple parallel misses Observation 2: hit can happen while a miss is pending Particularly true for OOO execution Observation 3: main memory system can supports a large number of in-flight requests

66 1 Non-blocking caches Should later cache lookups stall if there is a miss? Can induce head-of-the-line stall due to earlier miss Observation 1: misses usually happen in bursts; it is helpful to overlap latencies of multiple parallel misses Observation 2: hit can happen while a miss is pending Particularly true for OOO execution Observation 3: main memory system can supports a large number of in-flight requests Idea: let s make caches non-blocking i.e., cache keeps accepting new requests while waiting for misses to be handled (how many new requests?) More parallelism in cache subsystem important for superscalar and OOO

67 Implementing Non-blocking Caches On a miss: Send the request to next level of cache/memory and Put the miss information in a Miss Status Holding Register (MSHR) Instruction tag (ROB#), address, load-or-store, store value, Why? Tag miss request to later cache/memory with MSHR entry ID

68 Implementing Non-blocking Caches On a miss: Send the request to next level of cache/memory and Put the miss information in a Miss Status Holding Register (MSHR) Instruction tag (ROB#), address, load-or-store, store value, Why? Tag miss request to later cache/memory with MSHR entry ID When memory response arrives:

69 Implementing Non-blocking Caches On a miss: Send the request to next level of cache/memory and Put the miss information in a Miss Status Holding Register (MSHR) Instruction tag (ROB#), address, load-or-store, store value, Why? Tag miss request to later cache/memory with MSHR entry ID When memory response arrives: Find the corresponding MSHR entry using the MSHR tag Merge memory response data with store value (if store miss) and write to cache Broadcast results on CDB (if load miss)

70 Implementing Non-blocking Caches If a new load/store request to an already missing line Can merge the new miss into existing MSHR Instead of sending another request to next cache level/main memory MSHR entries should be big enough to keep info for multiple pending misses to the same line

71 Implementing Non-blocking Caches If a new load/store request to an already missing line Can merge the new miss into existing MSHR Instead of sending another request to next cache level/main memory MSHR entries should be big enough to keep info for multiple pending misses to the same line How many MSHRs?

72 Implementing Non-blocking Caches If a new load/store request to an already missing line Can merge the new miss into existing MSHR Instead of sending another request to next cache level/main memory MSHR entries should be big enough to keep info for multiple pending misses to the same line How many MSHRs? Typically 10s. Depends upon targeted b/w and average latency of each miss

73 Implementing Non-blocking Caches If a new load/store request to an already missing line Can merge the new miss into existing MSHR Instead of sending another request to next cache level/main memory MSHR entries should be big enough to keep info for multiple pending misses to the same line How many MSHRs? Typically 10s. Depends upon targeted b/w and average latency of each miss Remember Little s law from first lecture? E.g., 11 at L1 level in current Intel Xeon (server) processors

74 2 Parallel vs. Serial Caches Tag and Data usually separate SRAMs tag is smaller & faster State bits stored along with tags Valid bit, LRU bit(s), Parallel access to tag and data reduces latency (good for L1) Serial access to tag and data reduces power (good for L2+) enable = = = = = = = = valid? valid? hit? data hit? data

75 3 Victim caches Associativity is expensive Performance overhead from extra muxes Power overhead from reading and checking more tags and data Conflicts are expensive Performance from extra misses Observation: Conflicts don t occur in all sets Idea: use a fully-associative victim cache to absorbs blocks displaced from the main cache Extend associativity of sets experiencing many conflict missses

76 Victim cache Access Sequence: E A B N M C K L D J 4-way Set-Associative L1 Cache DE A B C DC X Y Z MN J KJ KL ML P Q R Every access is a miss! ABCDE and JKLMN do not fit in a 4-way set associative cache Provide extra associativity, but not for all sets

77 Victim cache Access Sequence: 4-way Set-Associative L1 Cache 4-way Set-Associative L1 Cache + Fully-Associative Victim Cache E A B N M C K L D J DE A B C DC X Y Z MN J KJ KL ML NJ KJ KL ML P Q R P Q R Every access is a miss! ABCDE and JKLMN do not fit in a 4-way set associative cache AE BA CB DC X Y Z Victim cache provides a fifth way so long as only four sets overflow into it at the same time DC AB JKLM Can even provide 6 th or 7 th ways Provide extra associativity, but not for all sets

78 4 Cache inclusivity Core often accesses blocks not present in any $ Should block be allocated in L3, L2, and L1? Called Inclusive caches Waste of space Requires forced evict (e.g., force evict from L1 on evict from L2+) Only allocate blocks in L1 Called non-inclusive caches (why not exclusive?) Some processors combine both L3 is inclusive of L1 and L2 Why do inclusive caches? L2 is non-inclusive of L1 (like a large victim cache)

79 5 Write propagation When to propagate new value to (lower level) memory? Option #1: Write-through: immediately On hit, update cache Immediately send the write to the next level Option #2: Write-back: when block is replaced/evicted Requires additional dirty bit per block Replace clean block: no extra traffic Replace dirty block: extra writeback of block What are the trade-offs? 37

80 Write propagation comparison Write-through Requires additional bus bandwidth Consider repeated write hits Next level must handle small writes (1, 2, 4, 8-bytes) + No need for valid bits in cache + No need to handle writeback operations Simplifies miss handling (no WBB) Sometimes used for L1 caches (for example, by IBM) Write-back + Key advantage: uses less bandwidth Reverse of other pros/cons above Used by Intel and AMD Second-level and beyond are generally write-back caches 38

81 6 Write/store miss handling Should we bring the data to the cache on a write miss? For L1 cache store == write 39

82 Write miss handling How is a write miss actually handled? Write-allocate: fill block from next level, then write it + Decreases read misses (next read to block will hit) Requires additional bandwidth Commonly used (especially with write-back caches) Write-non-allocate: just write to next level, no allocate Potentially more read misses + Uses less bandwidth Use with write-through 40

83 Write misses latency Read miss? Load can t go on without the data, it must stall Write miss? What happens to the store instruction when it reaches the head of the ROB and it misses in the cache?

84 Write misses latency Read miss? Load can t go on without the data, it must stall Write miss? What happens to the store instruction when it reaches the head of the ROB and it misses in the cache? Stalls retirement of later instructions What can we do about it? Remember stores can be end of a dependence chain of instructions Is the latency to store in critical path?

85 Write Misses and Write Buffers Read miss? Load can t go on without the data, it must stall Write miss? Technically, if no instruction is waiting for data, why stall? Write buffer (a.k.a., store buffer): a small buffer How does it help? Write buffer vs. writeback-buffer Write buffer: in front of L1 D$, for hiding store misses Writeback buffer: behind L1 D$, for hiding writebacks Any possible issues? WBB Processor WB Cache Next-level cache 42

86 Write Misses and Write Buffers Read miss? Load can t go on without the data, it must stall Write miss? Technically, if no instruction is waiting for data, why stall? Write buffer (a.k.a., store buffer): a small buffer How does it help? Write buffer vs. writeback-buffer Write buffer: in front of L1 D$, for hiding store misses Writeback buffer: behind L1 D$, for hiding writebacks Any possible issues? Forwarding again?? WBB Processor WB Cache Next-level cache 42

87 7 Local vs Global Miss Rates Local hit/miss rate: Percent of references to cache hit (e.g, 90%) Local miss rate is (100% - local hit rate), (e.g., 10%) Global hit/miss rate: Misses per instruction (1 miss per 30 instructions) Instructions per miss (3% of instructions miss) Above assumes loads/stores are 1 in 3 instructions Consider second-level cache hit rate L1: 2 misses per 100 instructions L2: 1 miss per 100 instructions 43

88 7 Local vs Global Miss Rates Local hit/miss rate: Percent of references to cache hit (e.g, 90%) Local miss rate is (100% - local hit rate), (e.g., 10%) Global hit/miss rate: Misses per instruction (1 miss per 30 instructions) Instructions per miss (3% of instructions miss) Above assumes loads/stores are 1 in 3 instructions Consider second-level cache hit rate L1: 2 misses per 100 instructions L2: 1 miss per 100 instructions L2 local miss rate -> 50% 43

89 Techniques We ve Seen So Far Use Caching to reduce memory latency Use wide out-of-order execution to hide memory latency By overlapping misses with other useful work Cannot efficiently go much wider than several instructions Need a different strategy 44

90 Techniques We ve Seen So Far Use Caching to reduce memory latency Use wide out-of-order execution to hide memory latency By overlapping misses with other useful work Cannot efficiently go much wider than several instructions Neither is enough for server applications Not much locality (mostly accessing linked data structures) Not much ILP and MLP Server apps spend 50-66% of their time stalled on memory Need a different strategy 44

91 Prefetching 9/12/

92 Prefetching (1) Fetch data ahead of demand Why it could be useful?

93 Prefetching (1) Fetch data ahead of demand Why it could be useful? Can any caching technique avoid compulsory miss? Caching technique does not tell what could be needed next o Replacement policy predicts what in cache may not be required

94 Prefetching (1) Fetch data ahead of demand Why it could be useful? Can any caching technique avoid compulsory miss? Caching technique does not tell what could be needed next o Replacement policy predicts what in cache may not be required Main challenges: Knowing what to fetch Fetching useless blocks wastes resources Knowing when to fetch Too early pollutes cache or gets thrown out before use) Fetching too late defeats purpose of pre -fetching

95 Prefetching (1) Fetch data ahead of demand Why it could be useful? Can any caching technique avoid compulsory miss? Caching technique does not tell what could be needed next o Replacement policy predicts what in cache may not be required Main challenges: Knowing what to fetch Fetching useless blocks wastes resources Knowing when to fetch Too early pollutes cache or gets thrown out before use) Fetching too late defeats purpose of pre -fetching Prefetching must be accurate and timely

96 Prefetching (2) Without prefetching:

97 Prefetching (2) Without prefetching: Load time

98 Prefetching (2) Without prefetching: L1 Load time

99 Prefetching (2) Without prefetching: L1 L2 Load time

100 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data time

101 Prefetching (2) Without prefetching: L1 L2 DRAM Load time Total Load-to-Use Latency Data

102 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency

103 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch

104 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch

105 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Load

106 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Load Data

107 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Load Data Much improved Load-to-Use Latency

108 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Or: Load Data Much improved Load-to-Use Latency

109 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Or: Load Data Much improved Load-to-Use Latency Prefetch

110 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Or: Load Data Much improved Load-to-Use Latency Prefetch

111 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Or: Load Data Much improved Load-to-Use Latency Prefetch Load

112 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Or: Load Data Much improved Load-to-Use Latency Prefetch Load Data

113 Prefetching (2) Without prefetching: L1 L2 DRAM Load Data With prefetching: time Total Load-to-Use Latency Prefetch Or: Load Data Much improved Load-to-Use Latency Prefetch Load Data Somewhat improved Latency Prefetching must be accurate and timely

114 Types of Prefetching Software Hardware

115 Types of Prefetching Software By compiler By programmer Hardware

116 Types of Prefetching Software By compiler By programmer Hardware Next-Line, Adjacent-Line Next-N-Line Stream Buffers Stride Localized (PC-based) Pointer Correlation

117 Software Prefetching (1) Prefetch data using explicit instructions Inserted by compiler and/or programmer

118 Software Prefetching (1) Prefetch data using explicit instructions Inserted by compiler and/or programmer Put prefetched value into

119 Software Prefetching (1) Prefetch data using explicit instructions Inserted by compiler and/or programmer Put prefetched value into Register (binding prefetch) Also called hoisting (same as reordering instruction) Basically, just moving the load instruction up in the program

120 Software Prefetching (1) Prefetch data using explicit instructions Inserted by compiler and/or programmer Put prefetched value into Register (binding prefetch) Also called hoisting (same as reordering instruction) Basically, just moving the load instruction up in the program Cache (non-binding prefetch) Requires ISA support May get evicted from cache before demand

121 Software Prefetching (2) A B C R1 = [R2] R3 = R1+4 Hoisting is prone to many problems:

122 Software Prefetching (2) A R1 = R1-1 WAW Violated R1 = [R2] A R1 = R1-1 B C R1 = [R2] R3 = R1+4 B C R3 = R1+4 Hoisting is prone to many problems: Must be aware of dependences

123 Software Prefetching (2) A R1 = R1-1 WAW Violated R1 = [R2] A R1 = R1-1 B C R1 = [R2] R3 = R1+4 B C R3 = R1+4 Hoisting is prone to many problems: Must be aware of dependences Must not cause exceptions not possible in the original execution Increases register pressure for the compiler

124 Software Prefetching (2) A R1 = R1-1 WAW Violated R1 = [R2] A R1 = R1-1 PREFETCH[R2] A R1 = R1-1 B C R1 = [R2] R3 = R1+4 B C R3 = R1+4 B C R1 = [R2] R3 = R1+4 Hoisting is prone to many problems: Must be aware of dependences Must not cause exceptions not possible in the original execution Increases register pressure for the compiler Using a prefetch instruction can avoid all these problems

125 Software Prefetching (3) for (I = 1; I < rows; I++) { for (J = 1; J < columns; J++) { prefetch(&x[i+1,j]); sum = sum + x[i,j]; } } Prefetch instruction reads the containing block from the memory and puts it in the cache

126 Software Prefetching (4) Pros: Gives programmer control and flexibility Allows for complex (compiler) analysis No (major) hardware modifications needed

127 Software Prefetching (4) Pros: Gives programmer control and flexibility Allows for complex (compiler) analysis No (major) hardware modifications needed Cons: Prefetch instructions increase code footprint May cause more I$ misses, code alignment issues Hard to perform timely prefetches At IPC=2 and 100-cycle memory move load 200 inst. earlier Might not even have 200 inst. in current function Prefetching earlier and more often leads to low accuracy Program may go down a different path (block B in prev. slides)

128 Hardware Prefetching Hardware monitors memory accesses Looks for common patterns Makes predictions Predicted addresses are placed into prefetch queue Queue is checked when no demand accesses waiting Prefetches look like load requests to the mem. hierarchy

129 Hardware Prefetching Hardware monitors memory accesses Looks for common patterns Makes predictions Predicted addresses are placed into prefetch queue Queue is checked when no demand accesses waiting Prefetches look like load requests to the mem. hierarchy Prefetchers may trade bandwidth for latency Extra bandwidth used only when guessing incorrectly Latency reduced only when guessing correctly

130 Hardware Prefetching Hardware monitors memory accesses Looks for common patterns Makes predictions Predicted addresses are placed into prefetch queue Queue is checked when no demand accesses waiting Prefetches look like load requests to the mem. hierarchy Prefetchers may trade bandwidth for latency Extra bandwidth used only when guessing incorrectly Latency reduced only when guessing correctly No need to change software

131 Hardware Prefetcher Design Space What to prefetch? Predict regular patterns (x, x+8, x+16, ) Predict correlated patterns (A..B->C, B..C->J, A..C->K, ) When to prefetch? Where to put prefetched data?

132 Hardware Prefetcher Design Space What to prefetch? Predict regular patterns (x, x+8, x+16, ) Predict correlated patterns (A..B->C, B..C->J, A..C->K, ) When to prefetch? On every reference lots of lookup/prefetch overhead On every miss patterns filtered by caches On prefetched-data hits (positive feedback) Where to put prefetched data?

133 Hardware Prefetcher Design Space What to prefetch? Predict regular patterns (x, x+8, x+16, ) Predict correlated patterns (A..B->C, B..C->J, A..C->K, ) When to prefetch? On every reference lots of lookup/prefetch overhead On every miss patterns filtered by caches On prefetched-data hits (positive feedback) Where to put prefetched data? Prefetch buffers Caches

134 Prefetching at Different Levels Processor I-TLB Registers L1 I-Cache L1 D-Cache D-TLB Intel Core2 Prefetcher Locations L2 Cache L3 Cache (LLC) Real CPUs have multiple prefetchers w/ different strategies Usually closer to the core (easier to detect patterns) Prefetching at LLC is hard. Why?

135 Prefetching at Different Levels Processor I-TLB Registers L1 I-Cache L1 D-Cache D-TLB Intel Core2 Prefetcher Locations L2 Cache L3 Cache (LLC) Real CPUs have multiple prefetchers w/ different strategies Usually closer to the core (easier to detect patterns) Prefetching at LLC is hard. Why? Mixed access patterns from many cores in shared LLC Typically heavily banked (why this is a problem?)

136 Next-Line (or Adjacent-Line) Prefetching On request for line X, prefetch X+1 Assumes spatial locality why not just increase block size? Should stop at physical (OS) page boundaries (why?)

137 Next-Line (or Adjacent-Line) Prefetching On request for line X, prefetch X+1 Assumes spatial locality why not just increase block size? Should stop at physical (OS) page boundaries (why?) Works for I$ and D$ Instructions execute sequentially Large data structures often span multiple blocks

138 Next-Line (or Adjacent-Line) Prefetching On request for line X, prefetch X+1 Assumes spatial locality why not just increase block size? Should stop at physical (OS) page boundaries (why?) Works for I$ and D$ Instructions execute sequentially Large data structures often span multiple blocks Simple, but usually not timely

139 Next-N-Line Prefetching On request for line X, prefetch X+1, X+2,, X+N N is called prefetch depth or prefetch degree

140 Next-N-Line Prefetching On request for line X, prefetch X+1, X+2,, X+N N is called prefetch depth or prefetch degree Must carefully tune depth N. Large N is more likely to be timely (always?) more aggressive more likely to make a mistake Might evict something useful more expensive need storage for prefetched lines Might delay useful request on interconnect or port

141 Next-N-Line Prefetching On request for line X, prefetch X+1, X+2,, X+N N is called prefetch depth or prefetch degree Must carefully tune depth N. Large N is more likely to be timely (always?) more aggressive more likely to make a mistake Might evict something useful more expensive need storage for prefetched lines Might delay useful request on interconnect or port Still simple, but more timely than Next-Line

142 Stride Prefetching (1) Column in matrix Elements in array of structs Access patterns often follow a stride Example 1: Accessing column of elements in a matrix Example 2: Accessing elements in array of structs

143 Stride Prefetching (1) Column in matrix Elements in array of structs Access patterns often follow a stride Example 1: Accessing column of elements in a matrix Example 2: Accessing elements in array of structs Detect stride S, prefetch depth N Prefetch X+S, X+2S,, X+NS

144 Stride Prefetching (2) Must carefully select depth N Same constraints as Next-N-Line prefetcher How to tell the diff. between A[i] A[i+1] and X Y? Wait until you see the same stride a few times Can vary prefetch depth based on confidence More consecutive strided accesses higher confidence New access to A+3S Last Addr Stride Count A+2S S 2 >2 Do prefetch? = + + A+4S (addr to prefetch) Update count

145 Localized Stride Prefetchers (1) What if multiple strides are interleaved? Y = A + X? Load R1 = [R2] Load R3 = [R4] Add R5, R1, R3 Store [R6] = R5 A, X, Y, A+S0, X+S1, Y+S2, A+2S0, X+2S1, Y+2S2,

146 Localized Stride Prefetchers (1) What if multiple strides are interleaved? No clearly-discernible stride Y = A + X? Load R1 = [R2] Load R3 = [R4] Add R5, R1, R3 Store [R6] = R5 A, X, Y, A+S0, X+S1, Y+S2, A+2S0, X+2S1, Y+2S2, (X-A) (Y-X) (X-A) (Y-X) (X-A) (Y-X) (A+S-Y) (A+S-Y) (A+S-Y)

147 Localized Stride Prefetchers (1) What if multiple strides are interleaved? No clearly-discernible stride Y = A + X? Load R1 = [R2] Load R3 = [R4] Add R5, R1, R3 Store [R6] = R5 A, X, Y, A+S0, X+S1, Y+S2, A+2S0, X+2S1, Y+2S2, (X-A) (Y-X) (X-A) (Y-X) (X-A) (Y-X) (A+S-Y) (A+S-Y) (A+S-Y) Observation: Accesses to structures usually corelates to an instruction PC

148 Localized Stride Prefetchers (1) What if multiple strides are interleaved? No clearly-discernible stride Y = A + X? Load R1 = [R2] Load R3 = [R4] Add R5, R1, R3 Store [R6] = R5 A, X, Y, A+S0, X+S1, Y+S2, A+2S0, X+2S1, Y+2S2, (X-A) (Y-X) (X-A) (Y-X) (X-A) (Y-X) (A+S-Y) (A+S-Y) (A+S-Y) Observation: Accesses to structures usually corelates to an instruction PC Idea: Use an array of strides, indexed by PC

149 Localized Stride Prefetchers (2) Store PC, last address, last stride, and count in a Reference Prediction Table (RPT) On access, check RPT Same stride? count++ if yes, count-- or count=0 if no If count is high, prefetch (last address + stride) PC: 0x409A34 PC: 0x409A38 PC: 0x409A40 Load R1 = [R2] Load R3 = [R4] Store [R6] = R5 Tag 0x409 0x409 Last Addr Stride Count A+3S S0 2 + X+3S1 S1 2 If confident about the stride (count > C min ), prefetch (A+4S) 0x409 Y+2S2 S2 1

150 Other Prefetch Patterns Sometimes accesses are highly predictable, but no strides Linked data structures (e.g., lists or trees) A B C D E F Linked-list traversal

151 Other Prefetch Patterns Sometimes accesses are highly predictable, but no strides Linked data structures (e.g., lists or trees) A B C D E F Linked-list traversal D A F B C Actual memory layout (no chance to detect a stride) E

152 Pointer Prefetching (1) Data filled on cache miss (512 bits of data) Which ones are pointers? Nope Nope Nope Nope Maybe! Maybe! Nope Nope Pointers usually look different

153 Pointer Prefetching (1) Data filled on cache miss (512 bits of data) Which ones are pointers? Nope Nope Nope Nope Maybe! Maybe! Nope Nope Go ahead and prefetch these (needs some help from the TLB) Pointers usually look different

154 Pointer Prefetching (1) Data filled on cache miss (512 bits of data) Which ones are pointers? Nope Nope Nope Nope Maybe! Maybe! Nope Nope struct bintree_node_t { int data1; int data2; struct bintree_node_t * left; struct bintree_node_t * right; }; Go ahead and prefetch these (needs some help from the TLB) This allows you to walk the tree (or other pointer-based data structures which are typically hard to prefetch) Pointers usually look different

155 Pointer Prefetching (2) Relatively cheap to implement Don t need extra hardware to store patterns But can fetch a lot of junk Limited lookahead makes timely prefetches hard Can t get next pointer until fetched data block What is the prefetch depth? Stride Prefetcher: X X+S Access Latency Access Latency X+2S Access Latency Pointer Prefetcher: A Access Latency B Access Latency C Access Latency

156 Pair-wise Temporal Correlation (1) Accesses exhibit temporal correlation If E followed D in the past if we see D, prefetch E Somewhat similar to history-based branch prediction Correlation Table Linked-list traversal D F A D F E? A B C D E F Actual memory layout B A B 11 D F C E B C C D E A B C E F 01 Can use recursively to get more lookahead

157 Pair-wise Temporal Correlation (2) Many patterns more complex than linked lists Can be represented by a Markov Model Required tracking multiple potential successors Number of candidates is called breadth Correlation Table A B C.67 Markov Model D E F Recursive breadth & depth grows exponentially.5 D F A B C E D F A B C E C E B C D A E? C? F?

158 Increasing Correlation History Length Like branch prediction, longer history can provide more accuracy (a.k.a, looking for sequence of accesses) And increases training time Use history hash for lookup E.g., XOR the bits of the addresses of the last K accesses Tree traversal: ABDBEBACFCGCA A B A B C B D D B B E F E D A D E F G E B B B B A C A C Better accuracy, larger storage cost

159 Memory interface Stream Buffers (1) Used to avoid cache pollution caused by deep prefetching FIFO Each buffer holds one stream of sequentially prefetched lines Keep next-n available in buffer On a load miss, check the head of all buffers if match, pop the entry from FIFO, fetch the N+1 st line into the buffer if miss, allocate a new stream buffer (use LRU for recycling) Cache FIFO FIFO FIFO

160 Stream Buffers (2) Can incorporate stride prediction mechanisms to support non-unit-stride streams Can extend to quasi-sequential stream buffer On request Y in [X X+N], advance by Y-X+1 Allows buffer to work when items are skipped Requires expensive (associative) comparison

161 Example: Prefetchers in Intel Sandy Bridge Data L1 1) PC-localized stride prefetcher 2) Next line prefetcher Only on an ascending access to very recently loaded data L2 3) Stream buffer/prefetcher: Prefetch the cache line which pairs with current one to make a 128-byte aligned chunk 4) Stream prefetcher: detects streams of requests made by L1 (I and D) and prefetches lines down the stream # of lines to prefetch depends on # of outstanding requests from L1 Far lines are only prefetched to L3; closer ones are brought to L2 See Intel Architectures Optimization Reference Manual for more details

162 Metrics for prefetching Coverage = # of $ miss eliminated by prefetching) / # of total $ miss # total cache miss = # of misses eliminated by prefetching + # of misses not eliminated by prefetching Accuracy = # of misses eliminated by prefetching/ (# of useless prefetch + # of misses eliminated by prefetching) Timeliness, a qualitative metric 9/12/

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand Cache Design Basics Nima Honarmand Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled

More information

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand Caches Nima Honarmand Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required by all of the running applications

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 omputer Architecture Spring 2016 Lecture 09: Prefetching Shuai Wang Department of omputer Science and Technology Nanjing University Prefetching(1/3) Fetch block ahead of demand Target compulsory, capacity,

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 07: Caches II Shuai Wang Department of Computer Science and Technology Nanjing University 63 address 0 [63:6] block offset[5:0] Fully-AssociativeCache Keep blocks

More information

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15

More information

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont.   History Table. Correlating Prediction Table Lecture 15 History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2018 Jon Beaumont A1 http://www.eecs.umich.edu/courses/eecs470 Prefetch A3 Slides developed in part by Profs.

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture Announcements! Previous lecture Caches Inf3 Computer Architecture - 2016-2017 1 Recap: Memory Hierarchy Issues! Block size: smallest unit that is managed at each level E.g., 64B for cache lines, 4KB for

More information

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN CHAPTER 4 TYPICAL MEMORY HIERARCHY MEMORY HIERARCHIES MEMORY HIERARCHIES CACHE DESIGN TECHNIQUES TO IMPROVE CACHE PERFORMANCE VIRTUAL MEMORY SUPPORT PRINCIPLE OF LOCALITY: A PROGRAM ACCESSES A RELATIVELY

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

CPU issues address (and data for write) Memory returns data (or acknowledgment for write) The Main Memory Unit CPU and memory unit interface Address Data Control CPU Memory CPU issues address (and data for write) Memory returns data (or acknowledgment for write) Memories: Design Objectives

More information

1/19/2009. Data Locality. Exploiting Locality: Caches

1/19/2009. Data Locality. Exploiting Locality: Caches Spring 2009 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Data Locality Temporal: if data item needed now, it is likely to be needed again in near future Spatial: if data item needed now, nearby

More information

Lecture 7 - Memory Hierarchy-II

Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

EECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont

EECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont Lecture 14 Advanced Caches DEC Alpha Fall 2018 Instruction Cache BIU Jon Beaumont www.eecs.umich.edu/courses/eecs470/ Data Cache Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

EECS 470 Lecture 13. Basic Caches. Fall 2018 Jon Beaumont

EECS 470 Lecture 13. Basic Caches. Fall 2018 Jon Beaumont Basic Caches Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of

More information

Fall 2007 Prof. Thomas Wenisch

Fall 2007 Prof. Thomas Wenisch Basic Caches Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Caches Concepts Review

Caches Concepts Review Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 2: M E M O R Y M E M O R Y H I E R A R C H I E S M E M O R Y D A T A F L O W M E M O R Y A C C E S S O P T I M I Z A T I O N J A N L E M E I R E Memory Just

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Page 1. The Problem. Caches. Balancing Act. The Solution. Widening memory gap. As always. Deepen memory hierarchy

Page 1. The Problem. Caches. Balancing Act. The Solution. Widening memory gap. As always. Deepen memory hierarchy The Problem Caches Today s topics: Basics memory hierarchy locality cache models associative options calculating miss penalties some fundamental optimization issues Widening memory gap DRAM latency CAGR

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 10

ECE 571 Advanced Microprocessor-Based Design Lecture 10 ECE 571 Advanced Microprocessor-Based Design Lecture 10 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 2 October 2014 Performance Concerns Caches Almost all programming can be

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS3350B Computer Architecture Winter 2015 Lecture 3.1: Memory Hierarchy: What and Why? Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and Design, Patterson

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Organization Prof. Michel A. Kinsy The course has 4 modules Module 1 Instruction Set Architecture (ISA) Simple Pipelining and Hazards Module 2 Superscalar Architectures

More information

Caches. Samira Khan March 23, 2017

Caches. Samira Khan March 23, 2017 Caches Samira Khan March 23, 2017 Agenda Review from last lecture Data flow model Memory hierarchy More Caches The Dataflow Model (of a Computer) Von Neumann model: An instruction is fetched and executed

More information

Page 1. Memory Hierarchies (Part 2)

Page 1. Memory Hierarchies (Part 2) Memory Hierarchies (Part ) Outline of Lectures on Memory Systems Memory Hierarchies Cache Memory 3 Virtual Memory 4 The future Increasing distance from the processor in access time Review: The Memory Hierarchy

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

LECTURE 11. Memory Hierarchy

LECTURE 11. Memory Hierarchy LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Improving Cache Performance [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user

More information

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1) Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1) 1 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed the misses for an infinite cache

More information

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5) Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 More Cache Basics caches are split as instruction and data; L2 and L3 are unified The /L2 hierarchy can be inclusive,

More information

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook CS356: Discussion #9 Memory Hierarchy and Caches Marco Paolieri (paolieri@usc.edu) Illustrations from CS:APP3e textbook The Memory Hierarchy So far... We modeled the memory system as an abstract array

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!

More information

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James Computer Systems Architecture I CSE 560M Lecture 17 Guest Lecturer: Shakir James Plan for Today Announcements and Reminders Project demos in three weeks (Nov. 23 rd ) Questions Today s discussion: Improving

More information

Key Point. What are Cache lines

Key Point. What are Cache lines Caching 1 Key Point What are Cache lines Tags Index offset How do we find data in the cache? How do we tell if it s the right data? What decisions do we need to make in designing a cache? What are possible

More information

Advanced Computer Architecture

Advanced Computer Architecture ECE 563 Advanced Computer Architecture Fall 2009 Lecture 3: Memory Hierarchy Review: Caches 563 L03.1 Fall 2010 Since 1980, CPU has outpaced DRAM... Four-issue 2GHz superscalar accessing 100ns DRAM could

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Performance metrics for caches

Performance metrics for caches Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache / total number of memory references Typically h = 0.90 to 0.97 Equivalent metric:

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Lecture 11. Virtual Memory Review: Memory Hierarchy

Lecture 11. Virtual Memory Review: Memory Hierarchy Lecture 11 Virtual Memory Review: Memory Hierarchy 1 Administration Homework 4 -Due 12/21 HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache size, block size, associativity

More information

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory / DRAM SRAM = Static RAM SRAM vs. DRAM As long as power is present, data is retained DRAM = Dynamic RAM If you don t do anything, you lose the data SRAM: 6T per bit

More information

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

Memories. CPE480/CS480/EE480, Spring Hank Dietz. Memories CPE480/CS480/EE480, Spring 2018 Hank Dietz http://aggregate.org/ee480 What we want, what we have What we want: Unlimited memory space Fast, constant, access time (UMA: Uniform Memory Access) What

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address

More information

Course Administration

Course Administration Spring 207 EE 363: Computer Organization Chapter 5: Large and Fast: Exploiting Memory Hierarchy - Avinash Kodi Department of Electrical Engineering & Computer Science Ohio University, Athens, Ohio 4570

More information

Memory Hierarchy I - Cache

Memory Hierarchy I - Cache CSEng120 Computer Architecture Memory Hierarchy I - Cache RHH 1 Cache Application OS Compiler Firmware Instruction Set Architecture CPU I/O Memory Digital Circuits Memory hierarchy concepts Cache organization

More information

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Show Me the $... Performance And Caches

Show Me the $... Performance And Caches Show Me the $... Performance And Caches 1 CPU-Cache Interaction (5-stage pipeline) PCen 0x4 Add bubble PC addr inst hit? Primary Instruction Cache IR D To Memory Control Decode, Register Fetch E A B MD1

More information

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2) The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2017 Lecture 15 LAST TIME: CACHE ORGANIZATION Caches have several important parameters B = 2 b bytes to store the block in each cache line S = 2 s cache sets

More information

Lecture 2: Caches. belongs to Milo Martin, Amir Roth, David Wood, James Smith, Mikko Lipasti

Lecture 2: Caches.   belongs to Milo Martin, Amir Roth, David Wood, James Smith, Mikko Lipasti Lecture 2: Caches http://wwwcssfuca/~ashriram/cs885/ belongs to Milo Martin, Amir Roth, David Wood, James Smith, Mikko Lipasti 1 Why focus on caches and memory? CPU can only compute as fast as memory Add

More information

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial

More information

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance 6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

EN1640: Design of Computing Systems Topic 06: Memory System

EN1640: Design of Computing Systems Topic 06: Memory System EN164: Design of Computing Systems Topic 6: Memory System Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University Spring

More information

Memory hierarchy review. ECE 154B Dmitri Strukov

Memory hierarchy review. ECE 154B Dmitri Strukov Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Six basic optimizations Virtual memory Cache performance Opteron example Processor-DRAM gap in latency Q1. How to deal

More information

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University Computer Organization and Structure Bing-Yu Chen National Taiwan University Large and Fast: Exploiting Memory Hierarchy The Basic of Caches Measuring & Improving Cache Performance Virtual Memory A Common

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB Memory Technology Caches 1 Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per GB Ideal memory Average access time similar

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand Main Memory & DRAM Nima Honarmand Main Memory Big Picture 1) Last-level cache sends its memory requests to a Memory Controller Over a system bus of other types of interconnect 2) Memory controller translates

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Lecture 11 Cache. Peng Liu.

Lecture 11 Cache. Peng Liu. Lecture 11 Cache Peng Liu liupeng@zju.edu.cn 1 Associative Cache Example 2 Associative Cache Example 3 Associativity Example Compare 4-block caches Direct mapped, 2-way set associative, fully associative

More information

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache

More information

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Portland State University ECE 587/687. Caches and Memory-Level Parallelism Portland State University ECE 587/687 Caches and Memory-Level Parallelism Revisiting Processor Performance Program Execution Time = (CPU clock cycles + Memory stall cycles) x clock cycle time For each

More information

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored

More information

Mo Money, No Problems: Caches #2...

Mo Money, No Problems: Caches #2... Mo Money, No Problems: Caches #2... 1 Reminder: Cache Terms... Cache: A small and fast memory used to increase the performance of accessing a big and slow memory Uses temporal locality: The tendency to

More information

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer)

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer) Agenda Caches Samira Khan March 23, 2017 Review from last lecture Data flow model Memory hierarchy More Caches The Dataflow Model (of a Computer) Von Neumann model: An instruction is fetched and executed

More information

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic Fall 2011 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Reading: Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000) If memory

More information

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File EE 260: Introduction to Digital Design Technology Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa 2 Technology Naive Register File Write Read clk Decoder Read Write 3 4 Arrays:

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

5 Solutions. Solution a. no solution provided. b. no solution provided

5 Solutions. Solution a. no solution provided. b. no solution provided 5 Solutions Solution 5.1 5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.1.6 S2 Chapter 5 Solutions Solution 5.2 5.2.1 4 5.2.2 a. I, J b. B[I][0] 5.2.3 a. A[I][J] b. A[J][I] 5.2.4 a. 3596 = 8 800/4 2 8 8/4 + 8000/4 b.

More information

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology Memory Hierarchies Instructor: Dmitri A. Gusev Fall 2007 CS 502: Computers and Communications Technology Lecture 10, October 8, 2007 Memories SRAM: value is stored on a pair of inverting gates very fast

More information

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt Memory Hierarchy 2/18/2016 CS 152 Sec6on 5 Colin Schmidt Agenda Review Memory Hierarchy Lab 2 Ques6ons Return Quiz 1 Latencies Comparison Numbers L1 Cache 0.5 ns L2 Cache 7 ns 14x L1 cache Main Memory

More information

Caching Basics. Memory Hierarchies

Caching Basics. Memory Hierarchies Caching Basics CS448 1 Memory Hierarchies Takes advantage of locality of reference principle Most programs do not access all code and data uniformly, but repeat for certain data choices spatial nearby

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information