ECE 4750 Computer Architecture, Fall 2018 T04 Fundamental Memory Microarchitecture

Size: px
Start display at page:

Download "ECE 4750 Computer Architecture, Fall 2018 T04 Fundamental Memory Microarchitecture"

Transcription

1 ECE 4750 Computer Arcecture, Fall 2018 T04 Fundamental Memory Microarcecture School of Electrical and Computer Engineering Cornell University revision: Memory Microarcectural Design Patterns Transactions and Steps Microarcecture Overview FSM Cache High-Level Idea for FSM Cache FSM Cache path FSM Cache Control Unit Analyzing Performance Pipelined Cache High-Level Idea for Pipelined Cache Pipelined Cache path and Control Unit Analyzing Performance Pipelined Cache with TLB Cache Microarcecture Optimizations Reduce Hit Time Reduce Miss Rate

2 4.3. Reduce Miss Penalty Cache Optimization Summary Case Study: ARM Cortex A8 and Intel Core i ARM Cortex A Intel Core i Copyright 2018 Christopher Batten, Christina Delimitrou. All rights reserved. This handout was originally prepared by Prof. Christopher Batten at Cornell University for ECE 4750 / CS It has since been updated by Prof. Christina Delimitrou. Download and use of this handout is permitted for individual educational non-commercial purposes only. Redistribution either in part or in whole via both commercial or non-commercial means requires written permission. 2

3 1. Memory Microarcectural Design Patterns 1.1. Transactions and Steps 1. Memory Microarcectural Design Patterns Time Mem Accesses Avg Cycles Mem Access Sequence Sequence Mem Access Time Cycle ( Avg Cycles Avg Cycles Num Misses + Mem Access Hit Num Accesses ) Avg Extra Cycles Miss Extra Accesses Microarcecture Hit Latency for Translation FSM Cache >1 1+ Pipelined Cache 1 1+ Pipelined Cache + TLB Transactions and Steps We can think of each memory access as a transaction Executing a memory access involves a sequence of steps Check : Check one or more s in Select Victim : Select victim line from using replacement policy Evict Victim : Evict victim line from and write victim to memory Refill : Refill requested line by ing line from memory Write Mem : Write requested word to memory Access : Read or write requested word in 3

4 1. Memory Microarcectural Design Patterns 1.1. Transactions and Steps Steps for Write-Through with No Write Allocate 1.2. Microarcecture Overview mmureq mmuresp Memory Management Unit req resp Control path Control Unit Status Steps for Write-Back with Write Allocate mreq mresp Main Memory >1 cycle combinational 4

5 2. FSM Cache 2. FSM Cache Time Mem Accesses Avg Cycles Mem Access Sequence Sequence Mem Access Time Cycle ( Avg Cycles Avg Cycles Num Misses + Mem Access Hit Num Accesses ) Avg Extra Cycles Miss Extra Accesses Microarcecture Hit Latency for Translation FSM Cache >1 1+ Pipelined Cache 1 1+ Pipelined Cache + TLB 1 0 Assumptions Page-based translation, no TLB, physically Single-ported combinational SRAMs for, storage Unrealistic combinational main memory Cache requests are 4 B Configuration mmureq req Control path Control Unit mmuresp Memory Management Unit resp Status Four 16 B lines Two-way set-associative Replacement policy: LRU Write policy: write-through, no write allocate 5 mreq Main Memory mresp >1 cycle combinational

6 2. FSM Cache 2.1. High-Level Idea for FSM Cache 2.1. High-Level Idea for FSM Cache Check miss Access write Select Victim Write Mem Refill write FSM Check Access Check write Access Write Mem Check Access Check miss Select Victim Refill Access Check Access 6

7 2. FSM Cache 2.2. FSM Cache path 2.2. FSM Cache path As with processors, we design our path by incrementally adding support for each transaction and resolving conflicts with muxes. Implementing READ transactions that!req_val 27b 1b 2b 00 MT tarray0_en Way 0 tarray0 tarray0_en Way 1 tarray1 darray_en 2b [127:96] [95:64] [63:32] [31:0] resp. MRD MT: Check MRD: Read array, return resp Implementing READ transactions that miss!req_val MT miss R0 R1 MRD z4b mem tarray0_en tarray0_wen Way 0 tarray0 27b 1b 2b tarray0_en tarray0_wen Way 1 00 memresp. 7 tarray1 victim victim _sel word enable 1111 darray_en darray_en darray_wen 2b [31:0] [127:96] [95:64] [63:32] resp. MT: Check MRD: Read array, return resp R0: Send refill memreq, get memresp R1: Write array with refill line

8 2. FSM Cache 2.2. FSM Cache path Implementing WRITE transactions that miss!req_val MT miss R0 R1 MRD write MWD z4b_sel tarray0_en tarray0_wen z4b z4b mem Way 0 tarray0 27b 1b 2b zext mem tarray0_en tarray0_wen Way 1 00 memresp. tarray1 victim victim _sel word enable 1111 darray_en darray_en darray_wen 2b [31:0] [127:96] [95:64] [63:32] resp. MT: Check MRD: Read array, return resp R0: Send refill memreq, get memresp R1: Write array with refill line MWD: Send write memreq, write array Implementing WRITE transactions that!req_val MT miss R0 R1 MRD write MWD z4b_sel tarray0_en tarray0_wen z4b z4b mem Way 0 tarray0 27b 1b 2b zext mem repl unit tarray0_en tarray0_wen Way 1 00 memresp. tarray1 victim victim _sel 1111 darray_ word_ write_ en_sel sel word enable word enable 1111 darray_en darray_en darray_wen 2b [31:0] [127:96] [95:64] [63:32] resp. MT: Check MRD: Read array, return resp R0: Send refill memreq, get memresp R1: Write array with refill line MWD: Send write memreq, write array 8

9 2. FSM Cache 2.3. FSM Cache Control Unit 2.3. FSM Cache Control Unit MT miss R0 R1 MRD write MWD We will need to keep valid bits in the control unit, with one valid bit for every line. We will also need to keep use bits which are updated on every access to indicate which was the last used line. Assume we create the following two control signals generated by the FSM control unit. ( tarray0 && valid0[] ) ( tarray1 && valid1[] ) victim!use[] MT MRD R0 R1 MWD tarray0 tarray1 darray darray worden z4b memreq cachresp en wen en wen en wen sel sel sel val op val 9

10 2. FSM Cache 2.4. Analyzing Performance 2.4. Analyzing Performance Time Mem Accesses Avg Cycles Mem Access Sequence Sequence Mem Access Time Cycle ( Avg Cycles Avg Cycles Num Misses + Mem Access Hit Num Accesses Estimating cycle time ) Avg Extra Cycles Miss!req_val MT miss R0 R1 MRD write MWD z4b_sel tarray0_en tarray0_wen z4b z4b mem Way 0 register /write 1τ array /write 10τ array /write 10τ mem /write 20τ decoder 3τ comparator 10τ mux 3τ repl unit 0τ z4b 0τ tarray0 27b 1b 2b zext mem repl unit tarray0_en tarray0_wen Way 1 00 memresp. 10 tarray1 victim victim _sel 1111 darray_ word_ write_ en_sel sel word enable word enable 1111 darray_en darray_en darray_wen 2b [31:0] [127:96] [95:64] [63:32] resp. MT: Check MRD: Read array, return resp R0: Send refill memreq, get memresp R1: Write array with refill line MWD: Send write memreq, write array

11 2. FSM Cache 2.4. Analyzing Performance Estimating AMAL Consider the following sequence of memory acceses which might correspond to copying 4 B elements from a source array to a destination array. Each array contains 64 elements. What is the AMAL? rd 0x1000 wr 0x2000 rd 0x1004 wr 0x2004 rd 0x1008 wr 0x rd 0x1040 wr 0x2040 Consider the following sequence of memory acceses which might correspond to incrementing 4 B elements in an array. The array contains 64 elements. What is the AMAL? rd 0x1000 wr 0x1000 rd 0x1004 wr 0x1004 rd 0x1008 wr 0x rd 0x1040 wr 0x

12 3. Pipelined Cache 3. Pipelined Cache Time Mem Accesses Avg Cycles Mem Access Sequence Sequence Mem Access Time Cycle ( Avg Cycles Avg Cycles Num Misses + Mem Access Hit Num Accesses ) Avg Extra Cycles Miss Extra Accesses Microarcecture Hit Latency for Translation FSM Cache >1 1+ Pipelined Cache 1 1+ Pipelined Cache + TLB 1 0 Assumptions Page-based translation, no TLB, physically Single-ported combinational SRAMs for, storage Unrealistic combinational main memory Cache requests are 4 B Configuration mmureq req Control path Control Unit mmuresp Memory Management Unit resp Status Four 16 B lines Direct-mapped Replacement policy: LRU Write policy: write-through, no write allocate 12 mreq Main Memory mresp >1 cycle combinational

13 3. Pipelined Cache 3.1. High-Level Idea for Pipelined Cache 3.1. High-Level Idea for Pipelined Cache Check miss Access write Select Victim Write Mem Refill write FSM Check Access Check write Access Write Mem Check Access Check miss Select Victim Refill Access Check Access Check Access Pipelined Check write Mem Write Access Check Access Check miss Select Victim Refill Access Check Access 13

14 3. Pipelined Cache 3.2. Pipelined Cache path and Control Unit 3.2. Pipelined Cache path and Control Unit As with processors, we incrementally adding support for each transaction and resolving conflicts using muxes. Implementing READ transactions that M0 Se tarray M1 Se tarray_en 26b 2b 2b 00 [127:96] [95:64] [63:32] [31:0] resp. Implementing WRITE transactions that M0 Se tarray M1 Se tarray_en repl 26b 2b 2b unit zext 00 darray _wen [127:96] [95:64] [63:32] [31:0] resp. mem mem 14

15 3. Pipelined Cache 3.2. Pipelined Cache path and Control Unit Implementing transactions that miss M0 Se FSM pipe M0 Se tarray M1 Se R0 R1 miss z4b_sel z4b tarray_en tarray_wen 00 repl 26b 2b 2b unit zext darray_ write_ mux_sel word_en_sel 1111 write word enable darray _wen [127:96] [95:64] [63:32] [31:0] resp. mem mem memresp. Hybrid pipeline/fsm design pattern Hit path is pipelined with two-cycle latency Miss path stalls in M0 se to refill line Pipeline diagram for pipelined with 2-cycle latency rd () wr () rd (miss) rd (miss) wr (miss) rd () 15

16 3. Pipelined Cache 3.2. Pipelined Cache path and Control Unit Parallel with pipelined write path M0 Se FSM pipe R0 R1 miss z4b_sel z4b tarray_en tarray_wen M0 Se tarray 00 repl 26b 2b 2b unit zext darray_ write_ mux_sel parallel sel word_en_sel 1111 write word enable darray _wen M1 Se [127:96] [95:64] [63:32] [31:0] resp. mem mem memresp. Pipeline diagram for parallel with pipelined write rd () rd () rd () wr () wr () rd () Achieves single-cycle latency for s Two-cycle latency for writes, but is this latency observable? With write acks, send write-ack back in M0 se How do we resolve structural hazards? 16

17 3. Pipelined Cache 3.2. Pipelined Cache path and Control Unit Resolving structural hazard by exposing in ISA rd () wr () nop rd () Resolving structural hazard with hardware stalling rd () wr () rd () ostall_m0 val_m0 && ( type_m0 RD ) && val_m1 && ( type_m1 WR ) Resolving structural hazard with hardware duplication rd () wr () rd () Resolving RAW hazard with software scheduling Software scheduling: hazard depends on memory ess, so difficult to know at compile time! Resolving RAW hazard with hardware stalling ostall_m

18 3. Pipelined Cache 3.2. Pipelined Cache path and Control Unit Resolving RAW hazard with hardware bypassing We could use the previous stall signal as our bypass signal, but we will also need a new bypass path in our path. Draw this new bypass path on the following path diagram. M0 Se FSM pipe M0 Se tarray M1 Se R0 R1 miss z4b_sel z4b tarray_en tarray_wen 00 repl 26b 2b 2b unit zext darray_ write_ mux_sel word_en_sel 1111 write word enable darray _wen write [127:96] [95:64] [63:32] [31:0] resp. mem mem memresp. 18

19 3. Pipelined Cache 3.2. Pipelined Cache path and Control Unit Parallel and pipelined write in set associative s To implement parallel in set-associative s, we must speculatively a line from each way in parallel with check. This can be expensive in terms of latency, motivating two-cycle latencies for highly associative s. M0 Se FSM pipe R0 R1 miss z4b_sel tarray0_en tarray0_wen z4b Way 0 M0 Se tarray0 tarray0_en tarray0_wen Way 1 tarray1 00 repl 26b 2b 2b unit zext darray_ write_ mux_sel parallel sel word_en_sel 1111 word enable darray_en darray _wen M1 Se Way 0 Way 1 way_sel [127:96] [95:64] [63:32] [31:0] resp. mem mem memresp. 19

20 3. Pipelined Cache 3.3. Analyzing Performance 3.3. Analyzing Performance Time Mem Accesses Avg Cycles Mem Access Sequence Sequence Mem Access Time Cycle ( Avg Cycles Avg Cycles Num Misses + Mem Access Hit Num Accesses Estimating cycle time ) Avg Extra Cycles Miss M0 Se FSM pipe R0 R1 miss z4b_sel z4b tarray_en tarray_wen M0 Se tarray 00 repl 26b 2b 2b unit zext darray_ write_ mux_sel parallel sel word_en_sel 1111 write word enable darray _wen M1 Se [127:96] [95:64] [63:32] [31:0] resp. mem mem memresp. register /write 1τ array /write 10τ array /write 10τ mem /write 20τ decoder 3τ comparator 10τ mux 3τ repl unit 0τ z4b 0τ 20

21 3. Pipelined Cache 3.3. Analyzing Performance Estimating AMAL Assume a parallel-/pipelined-write microarcecture with hardware duplication to resolve the structural hazard and stalling to resolve RAW hazards. Consider the following sequence of memory accesses which might correspond to copying 4 B elements from a source array to a destination array. Each array contains 64 elements. What is the AMAL? rd 0x1000 wr 0x2000 rd 0x1004 wr 0x2004 rd 0x1008 wr 0x2008 rd 0x100c wr 0x200c wr 0x

22 3. Pipelined Cache 3.4. Pipelined Cache with TLB 3.4. Pipelined Cache with TLB How should we integrate a MMU (TLB) into a pipelined? Physically Addressed Caches Perform memory translation before access Processor VA TLB PA Cache PA Main Memory Advanes: Physical esses are unique, so entries are unique Updating memory translation simply requires changing TLB Disadvanes: Increases latency Virtually Addressed Caches Perform memory translation after or in parallel with access Processor VA Cache VA TLB PA Main Memory Processor VA Cache PA Main Memory TLB 22

23 3. Pipelined Cache 3.4. Pipelined Cache with TLB Advanes: Simple one-step process for s Disadvanes: Intra-program protection (store protection bits in?) I/O uses physical (map into virtual space?) Virtual ess homonyms Virtual ess synonyms (aliases) Virtual Address Homonyms Single virtual ess points to two physical ess. Physical Address Space VA0 0x3000 VA0 0x3000 Program 1 Page Table Program 2 Page Table Physical Page Physical Page Physical Page PA0 0x7000 PA1 0x5000 V Way 0 1 0x3000 0xcafe Way 1 Way 2 Way 3 Physical Page Page in PMem Page not allocated Example scenario Program 1 brings VA 0x3000 into Program 1 is context swapped for program 2 Program 2 s in, but gets incorrect! Potential solutions Flush on context swap Store program ids (ess space IDS) in 23

24 3. Pipelined Cache 3.4. Pipelined Cache with TLB Virtual Address Synonyms (Aliases) Physical Address Space VA0 0x3000 VA1 0x1000 Program 1 Page Table Program 2 Page Table Physical Page Physical Page Physical Page Physical Page PA0 0x4000 V Way 0 1 0x3000 0xcafe Way 1 1 0x1000 0xbeef Way 2 Way 3 Page in PMem Page not allocated Example scenarios User and OS can potentially have different VAs point to same PA Memory map the same file (via mmap) in two different programs Potential solutions Hardware checks all ways (and potentially different sets) on a miss to ensure that a given physical ess can only live in one location in Software forces aliases to share some ess bits (page coloring) reducing the number of sets we need to check on a miss (or reducing the need to check any other locations for a direct mapped ) 24

25 3. Pipelined Cache 3.4. Pipelined Cache with TLB Virtually Indexed and Physically ged Caches Virtual Address VPN Virtual Page Num (v bits) Page Offset (m bits) Virtual Index (n bits) TLB Direct Mapped $ 2 n lines 2 b -byte line PPN Physical (k bits) Page set is the same in VA and PA? Physical (k bits) Up to n bits of physical ess available without translation If index bits + set (n+b) < page set bits (m), can do translation in parallel with ing out the physical and Complete (physical) check once / access complete With 4 KB pages, direct-mapped must be < 4 KB Larger page sizes (decrease k) or higher associativity (decrease n) enable larger virtually indexed, physically ged s 25

26 4. Cache Microarcecture Optimizations 4.2. Reduce Hit Time 4. Cache Microarcecture Optimizations Processor Hit Latency Miss Penalty Cache MMU Main Memory AMAL Hit Latency + ( Miss Rate Miss Penalty ) Reduce time Reduce miss rate Small and simple s Reduce miss penalty Multi-level hierarchy Prioritize s Large block size Large size High associativity Hardware prefetching Compiler optimizations 4.1. Reduce Hit Time Cache Microarcecture Optimizations way 4-way Ten Advanced Optimizations of Cache Performance 81 2-way 8-way Energy per in nano joules KB 32 KB 64 KB Cache size 128 KB 256 KB Figure 2.4 Energy consumption per increases as size and associativity are increased. As in the previous figure, CACTI is used for the modeling with the same technology parameters. The large penalty for eight-way set associative s is due to the cost of ing out eight s and the corresponding in parallel. Second Optimization: Way Prediction to Reduce Hit Time

27 4. Cache Microarcecture Optimizations 4.2. Reduce Miss Rate 4.2. Reduce Miss Rate Large Block Size Less overhead Exploit fast burst transfers from DRAM and over wide on-chip busses Can waste bandwidth if is not used Fewer blocks more conflicts Large Cache Size or High Associativity If size is doubled, miss rate usually drops by about 2 Direct-mapped of size N has about the same miss rate as a two-way set-associative of size N/2 27

28 4. Cache Microarcecture Optimizations 4.2. Reduce Miss Rate Hardware Prefetching Proc L1 D$ Prefetch Buffer Main Memory Cache Line Previous techniques only help capacity and conflict misses Hardware prefetcher looks for patterns in miss ess stream Attempts to predict what the next miss might be Prefetches this next miss into a pfetch buffer Very effective in reducing compulsory misses for streaming accesses 28

29 4. Cache Microarcecture Optimizations 4.2. Reduce Miss Rate Compiler Optimizations Restructuring code affects the block access sequence Group accesses together to improve spatial locality Re-order accesses to improve temporal locality Prevent from entering the Useful for variales that will only be accessed once before eviction Needs mechanism for software to tell hardware not to ( no-allocate instruction s or page table bits) Kill that will never be used again Streaming exploits spatial locality but not temporal locality Replace into dead- locations Loop Interchange and Fusion What type of locality does each optimization improve? for(j0; j < N; j++) { for(i0; i < M; i++) { x[i][j] 2 * x[i][j]; } } for(i0; i < N; i++) a[i] b[i] * c[i]; for(i0; i < N; i++) d[i] a[i] * c[i]; for(i0; i < M; i++) { for(j0; j < N; j++) { x[i][j] 2 * x[i][j]; } } What type of locality does this improve? for(i0; i < N; i++) { a[i] b[i] * c[i]; d[i] a[i] * c[i]; } What type of locality does this impro 29

30 4. Cache Microarcecture Optimizations 4.2. Reduce Miss Rate Matrix Multiply with Naive Code for(i0; i < N; i++) for(j0; j < N; j++) { r 0; for(k0; k < N; k++) r r + y[i][k] * z[k][j]; x[i][j] r; } z k j y k x j i i Not touched Old access New access Matrix Multiply with Cache Tiling for(jj0; jj < N; jjjj+b) for(kk0; kk < N; kkkk+b) for(i0; i < N; i++) for(jjj; j < min(jj+b,n); j++) { r 0; for(kkk; k < min(kk+b,n); k++) r r + y[i][k] * z[k][j]; x[i][j] x[i][j] + r; } y k z k x j j i i 30

31 4. Cache Microarcecture Optimizations 4.3. Reduce Miss Penalty 4.3. Reduce Miss Penalty Multi-Level Caches Hit Processor L1 Cache L2 Cache L1 Miss -- L2 Hit Main Memory AMAL L1 Hit Latency of L1 + ( Miss Rate of L1 AMAL L2 ) AMAL L2 Hit Latency of L2 + ( Miss Rate of L2 Miss Penalty of L2 ) Local miss rate misses in / accesses to Global miss rate misses in / processor memory accesses Misses per instruction misses in / number of instructions Use smaller L1 is there is also a L2 Trade increased L1 miss rate for reduced L1 time & L1 miss penalty Reduces average access energy Use simpler write-through L1 with on-chip L2 Write-back L2 cahce absorbs write traffic, doesn t go -chip Simplifies processor pipeline Simplifies on-chip coherence issues Inclusive Multilevel Cache Inner holds copy of in outer External coherence is simpler Exclusive Multilevel Cache Inner may hold in outer Swap lines between inner/outer on miss 31

32 4. Cache Microarcecture Optimizations 4.4. Cache Optimization Summary Prioritize Reads Processor Hit Cache Write Buffer Main Memory Miss Processor not stalled on writes, and misses can go ahead of writes to main memory Write buffer may hold updated value of location needed by miss On miss, wait for write buffer to be empty Check write buffer esses and bypass 4.4. Cache Optimization Summary Hit Miss Miss Technique Lat Rate Penalty BW HW Smaller s + 0 Avoid TLB before indexing 1 Large block size + 0 Large size + 1 High associativity + 1 Hardware prefetching 2 Compiler optimizations 0 Multi-level 2 Prioritize s 1 Pipelining

33 5. Case Study: ARM Cortex A8 and Intel Core i ARM Cortex A8 Performance of the Cortex-A8 Memory Hierarchy 5. Case Study: ARM Cortex A8 and Intel Core i ARM Cortex A8 TLB and the s, assuming 32 KB primary s and a 512 KB secondary with 16 KB page size. The memory hierarchy of the Cortex-A8 was simulated with 32 KB primary s and a 1 MB eight-way set associative L2 using the integer Minnespec benchmarks (see KleinOsowski and Lilja [2002]). Minnespec is a set of benchmarks consisting of the SPEC2000 benchmarks but with different inputs that reduce the running times by several orders of magnitude. Although the use of smaller inputs does not change the instruction mix, it does affect the Virtual add ress <32> Virtual page number <18> Page set <14> TLB <19> TLB <19> L1 index <7> Block set <6> To CPU L1 <19> L1 <64>? To CPU Physical ess <32> L2 compare ess <15> L2 index <11> Block set <6> To CPU L2 <15> L2 <512>? To L1 or CPU L1 Figure 2.16 The virtual ess, physical ess, indexes, s, and blocks for the ARM Cortex-A8 s and TLB. Since the instruction and hierarchies are symmetric, we show only one. The TLB (instruction or ) is fully associative with 32 entries. The L1 is four-way set associative with 64-byte blocks and 32 KB capacity. The L2 is eight-way set associative with 64-byte blocks and 1 MB capacity. This figure doesn t show the valid bits and protection bits for the s and TLB, nor the use of the way prediction bits that would dictate the predicted bank of the L1. 32 KB, 64 B lines, 4-way set-associative with random replacement 32K/ lines, 128 lines per set (set index 7 bits) Virtually indexed, physically ged with single-cycle latency Memory management unit TLB with multi-level page tables in physical memory TLB has 32 entries, fully associative, hardware TLB miss handler Variable page size: 4 KB, 16 KB, 64 KB, 1 MB, 16 MB (fig shows 16 KB) TLB entries can have wildcards to support multiple page sizes L2 1 MB, 64 B lines, 8-way set-associative 1M/64 16K lines, 2K lines per set (set index 11 bits) Physically essed with multi-cycle latency 33

34 Figure 2.19 Characteristics of the i7 s TLB structure, which has separate first-level 5. instruction Case Study: and ARM Cortex TLBs, A8both and Intel backed Core by i7a joint second-level TLB The Intel first-level Core i7 TLBs support the standard 4 KB page size, as well as having a limited number of entries of large 2 to 4 MB pages; only 4 KB pages are supported in the second-level TLB Intel Core i7 wo Memory Hierarchy Design Characteristic L1 L2 L3 described Size in Chapter In KB 2010, I/32 KB the Dfastest 256 i7 had KB a clock rate 2 MB of 3.3 per core GHz, which yields Associativity a peak instruction 4-way execution I/8-way rate D of waybillion instructions 16-way per second, or over 50 billion instructions per second for the four-core design. Access latency 4 cycles, pipelined 10 cycles 35 cycles The i7 can support up to three memory channels, each consisting of a separate set of DIMMs, and each of which can transfer in parallel. Using DDR Replacement scheme Pseudo-LRU Pseudo- Pseudo-LRU but with an LRU ordered selection algorihtm (DIMM PC8500), the i7 has a peak memory bandwith of just over 25 GB/sec. Figure Write-back i7 uses bit Characteristics with virtual merging esses of write the and three-level buffer 36-bit (more physical like hierarchy no esses, write allocate) in yielding the i7. All a maximum s physical use write-back memory and of a block 36 GB. size Memory of 64 bytes. management The L1 and is L2 handled s with are separate a two- three L3 is inclusive of L1/L2 level for each TLB core, (see while Appendix the L3 B, Section is shared B.4), among summarized the cores in on Figure a chip and is a total of 2 MB Hardware per core. All prefetching three s from are L2 nonblocking into L1, from and allow L3 into multiple L2 Figure 2.20 summarizes the i7 s three-level hierarchy. outstanding The first-level writes. Amerging Virtuallywrite indexed, buffer physically is used for the ged L1, L1 s s are virtually indexed and physically ged which (see holds Appendix in the B, event Section that B.3), the line Physically is not present essed in L1 when L2/L3 it is written. s (That is, an L1 write miss does not cause the while the L2 and L3 s are physically indexed. Figure 2.21 is labeled with the line to be allocated.) L3 is inclusive of L1 and L2; we explore this property in further detail when we explain multiprocessor s. Replacement is by a variant on pseudo- Characteristic LRU; in the case of L3 Instruction the block TLB replaced is always DLB the lowest Second-level numbered TLB way whose access bit is turned. This is not quite random but is easy to compute. Size Associativity 4-way 4-way 4-way Replacement Pseudo-LRU Pseudo-LRU Pseudo-LRU Access latency 1 cycle 1 cycle 6 cycles Miss 7 cycles 7 cycles Hundreds of cycles to access page table Figure 48 bit 2.19 virtual Characteristics esses and of the 36 i7 s bit physical TLB structure, esses which (36 has GB separate physical first-level mem) instruction and TLBs, both backed by a joint second-level TLB. The first-level 4 KB pages except for few large 2 4MB pages in L1 TLBs TLBs support the standard 4 KB page size, as well as having a limited number of entries of large 2 to 4 MB pages; only 4 KB pages are supported in the second-level TLB. Characteristic L1 L2 L3 Size 32 KB I/32 KB D 256 KB 2 MB per core Associativity 4-way I/8-way D 8-way 16-way Access latency 4 cycles, pipelined 3410 cycles 35 cycles Replacement scheme Pseudo-LRU Pseudo- Pseudo-LRU but with an

ECE 4750 Computer Architecture, Fall 2014 T05 FSM and Pipelined Cache Memories

ECE 4750 Computer Architecture, Fall 2014 T05 FSM and Pipelined Cache Memories ECE 4750 Computer Architecture, Fall 2014 T05 FSM and Pipelined Cache Memories School of Electrical and Computer Engineering Cornell University revision: 2014-10-08-02-02 1 FSM Set-Associative Cache Memory

More information

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!

More information

Lecture 7 - Memory Hierarchy-II

Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw

More information

EE 660: Computer Architecture Advanced Caches

EE 660: Computer Architecture Advanced Caches EE 660: Computer Architecture Advanced Caches Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa Based on the slides of Prof. David Wentzlaff Agenda Review Three C s Basic Cache

More information

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!

More information

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2 ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 13 Memory Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html

More information

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2 ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 13 Memory Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall12.html

More information

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab 1 due today Reading: Chapter 5.1 5.3 2 1 Overview How to

More information

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 7 Memory III

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 7 Memory III CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 7 Memory III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture Announcements! Previous lecture Caches Inf3 Computer Architecture - 2016-2017 1 Recap: Memory Hierarchy Issues! Block size: smallest unit that is managed at each level E.g., 64B for cache lines, 4KB for

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

COSC 6385 Computer Architecture. - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary HY425 Lecture 13: Improving Cache Performance Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 25, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 13: Improving Cache Performance 1 / 40

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design Edited by Mansour Al Zuair 1 Introduction Programmers want unlimited amounts of memory with low latency Fast

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time Cache Performance!! Memory system and processor performance:! CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time +

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming MEMORY HIERARCHY DESIGN B649 Parallel Architectures and Programming Basic Optimizations Average memory access time = Hit time + Miss rate Miss penalty Larger block size to reduce miss rate Larger caches

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time Cache Performance!! Memory system and processor performance:! CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time +

More information

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial

More information

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming MEMORY HIERARCHY BASICS B649 Parallel Architectures and Programming BASICS Why Do We Need Caches? 3 Overview 4 Terminology cache virtual memory memory stall cycles direct mapped valid bit block address

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand Caches Nima Honarmand Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required by all of the running applications

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

Memory Hierarchies 2009 DAT105

Memory Hierarchies 2009 DAT105 Memory Hierarchies Cache performance issues (5.1) Virtual memory (C.4) Cache performance improvement techniques (5.2) Hit-time improvement techniques Miss-rate improvement techniques Miss-penalty improvement

More information

Topics to be covered. EEC 581 Computer Architecture. Virtual Memory. Memory Hierarchy Design (II)

Topics to be covered. EEC 581 Computer Architecture. Virtual Memory. Memory Hierarchy Design (II) EEC 581 Computer Architecture Memory Hierarchy Design (II) Department of Electrical Engineering and Computer Science Cleveland State University Topics to be covered Cache Penalty Reduction Techniques Victim

More information

Inside out of your computer memories (III) Hung-Wei Tseng

Inside out of your computer memories (III) Hung-Wei Tseng Inside out of your computer memories (III) Hung-Wei Tseng Why memory hierarchy? CPU main memory lw $t2, 0($a0) add $t3, $t2, $a1 addi $a0, $a0, 4 subi $a1, $a1, 1 bne $a1, LOOP lw $t2, 0($a0) add $t3,

More information

ECE 4750 Computer Architecture, Fall 2017 T03 Fundamental Memory Concepts

ECE 4750 Computer Architecture, Fall 2017 T03 Fundamental Memory Concepts ECE 4750 Computer Architecture, Fall 2017 T03 Fundamental Memory Concepts School of Electrical and Computer Engineering Cornell University revision: 2017-09-26-15-52 1 Memory/Library Analogy 2 1.1. Three

More information

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache performance 4 Cache

More information

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!

More information

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance: #1 Lec # 9 Winter 2003 1-21-2004 Classification Steady-State Cache Misses: The Three C s of cache Misses: Compulsory Misses Capacity Misses Conflict Misses Techniques To Improve Cache Performance: Reduce

More information

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty Larger total cache capacity to reduce miss

More information

Advanced Caching Techniques

Advanced Caching Techniques Advanced Caching Approaches to improving memory system performance eliminate memory operations decrease the number of misses decrease the miss penalty decrease the cache/memory access times hide memory

More information

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt Memory Hierarchy 2/18/2016 CS 152 Sec6on 5 Colin Schmidt Agenda Review Memory Hierarchy Lab 2 Ques6ons Return Quiz 1 Latencies Comparison Numbers L1 Cache 0.5 ns L2 Cache 7 ns 14x L1 cache Main Memory

More information

Advanced Caching Techniques

Advanced Caching Techniques Advanced Caching Approaches to improving memory system performance eliminate memory accesses/operations decrease the number of misses decrease the miss penalty decrease the cache/memory access times hide

More information

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 02 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 2.1 The levels in a typical memory hierarchy in a server computer shown on top (a) and in

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann,

More information

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

COSC 6385 Computer Architecture - Memory Hierarchy Design (III) COSC 6385 Computer Architecture - Memory Hierarchy Design (III) Fall 2006 Reducing cache miss penalty Five techniques Multilevel caches Critical word first and early restart Giving priority to read misses

More information

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate Outline Lecture 7: EITF20 Computer Architecture Anders Ardö EIT Electrical and Information Technology, Lund University November 21, 2012 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21,

More information

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Opteron example Cache performance Six basic optimizations Virtual memory Processor DRAM gap (latency) Four issue superscalar

More information

Lecture 11 Cache. Peng Liu.

Lecture 11 Cache. Peng Liu. Lecture 11 Cache Peng Liu liupeng@zju.edu.cn 1 Associative Cache Example 2 Associative Cache Example 3 Associativity Example Compare 4-block caches Direct mapped, 2-way set associative, fully associative

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Misses Classifying Misses: 3 Cs! Compulsory The first access to a block is

More information

CMSC 611: Advanced Computer Architecture. Cache and Memory

CMSC 611: Advanced Computer Architecture. Cache and Memory CMSC 611: Advanced Computer Architecture Cache and Memory Classification of Cache Misses Compulsory The first access to a block is never in the cache. Also called cold start misses or first reference misses.

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time Review ABC of Cache: Associativity Block size Capacity Cache organization Direct-mapped cache : A =, S = C/B

More information

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion Improving Cache Performance Dr. Yitzhak Birk Electrical Engineering Department, Technion 1 Cache Performance CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time Memory

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 2: M E M O R Y M E M O R Y H I E R A R C H I E S M E M O R Y D A T A F L O W M E M O R Y A C C E S S O P T I M I Z A T I O N J A N L E M E I R E Memory Just

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address

More information

Introduction to OpenMP. Lecture 10: Caches

Introduction to OpenMP. Lecture 10: Caches Introduction to OpenMP Lecture 10: Caches Overview Why caches are needed How caches work Cache design and performance. The memory speed gap Moore s Law: processors speed doubles every 18 months. True for

More information

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is

More information

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance 6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,

More information

Lec 11 How to improve cache performance

Lec 11 How to improve cache performance Lec 11 How to improve cache performance How to Improve Cache Performance? AMAT = HitTime + MissRate MissPenalty 1. Reduce the time to hit in the cache.--4 small and simple caches, avoiding address translation,

More information

CSC 631: High-Performance Computer Architecture

CSC 631: High-Performance Computer Architecture CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 10: Memory Part II CSC 631: High-Performance Computer Architecture 1 Two predictable properties of memory references: Temporal Locality:

More information

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN CHAPTER 4 TYPICAL MEMORY HIERARCHY MEMORY HIERARCHIES MEMORY HIERARCHIES CACHE DESIGN TECHNIQUES TO IMPROVE CACHE PERFORMANCE VIRTUAL MEMORY SUPPORT PRINCIPLE OF LOCALITY: A PROGRAM ACCESSES A RELATIVELY

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

CSE 502 Graduate Computer Architecture

CSE 502 Graduate Computer Architecture Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 CSE 502 Graduate Computer Architecture Lec 11-14 Advanced Memory Memory Hierarchy Design Larry Wittie Computer Science, StonyBrook

More information

Memory hierarchy review. ECE 154B Dmitri Strukov

Memory hierarchy review. ECE 154B Dmitri Strukov Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Six basic optimizations Virtual memory Cache performance Opteron example Processor-DRAM gap in latency Q1. How to deal

More information

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016 Caches and Memory Hierarchy: Review UCSB CS240A, Winter 2016 1 Motivation Most applications in a single processor runs at only 10-20% of the processor peak Most of the single processor performance loss

More information

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University COSC4201 Chapter 5 Memory Hierarchy Design Prof. Mokhtar Aboelaze York University 1 Memory Hierarchy The gap between CPU performance and main memory has been widening with higher performance CPUs creating

More information

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations DECstation 5 Miss Rates Cache Performance Measures % 3 5 5 5 KB KB KB 8 KB 6 KB 3 KB KB 8 KB Cache size Direct-mapped cache with 3-byte blocks Percentage of instruction references is 75% Instr. Cache Data

More information

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017 Caches and Memory Hierarchy: Review UCSB CS24A, Fall 27 Motivation Most applications in a single processor runs at only - 2% of the processor peak Most of the single processor performance loss is in the

More information

Cache Optimisation. sometime he thought that there must be a better way

Cache Optimisation. sometime he thought that there must be a better way Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

More information

Chapter 2 (cont) Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Chapter 2 (cont) Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002, Chapter 2 (cont) Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Improving Cache Performance Average mem access time = hit time + miss rate * miss penalty speed up

More information

Cray XE6 Performance Workshop

Cray XE6 Performance Workshop Cray XE6 Performance Workshop Mark Bull David Henty EPCC, University of Edinburgh Overview Why caches are needed How caches work Cache design and performance. 2 1 The memory speed gap Moore s Law: processors

More information

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

Memories. CPE480/CS480/EE480, Spring Hank Dietz. Memories CPE480/CS480/EE480, Spring 2018 Hank Dietz http://aggregate.org/ee480 What we want, what we have What we want: Unlimited memory space Fast, constant, access time (UMA: Uniform Memory Access) What

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Types of Cache Misses: The Three C s

Types of Cache Misses: The Three C s Types of Cache Misses: The Three C s 1 Compulsory: On the first access to a block; the block must be brought into the cache; also called cold start misses, or first reference misses. 2 Capacity: Occur

More information

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University Computer Organization and Structure Bing-Yu Chen National Taiwan University Large and Fast: Exploiting Memory Hierarchy The Basic of Caches Measuring & Improving Cache Performance Virtual Memory A Common

More information

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Lecture 11. Virtual Memory Review: Memory Hierarchy

Lecture 11. Virtual Memory Review: Memory Hierarchy Lecture 11 Virtual Memory Review: Memory Hierarchy 1 Administration Homework 4 -Due 12/21 HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache size, block size, associativity

More information

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging 6.823, L8--1 Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Highly-Associative

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Portland State University ECE 587/687. Caches and Memory-Level Parallelism Portland State University ECE 587/687 Caches and Memory-Level Parallelism Revisiting Processor Performance Program Execution Time = (CPU clock cycles + Memory stall cycles) x clock cycle time For each

More information

Virtual Memory. Motivation:

Virtual Memory. Motivation: Virtual Memory Motivation:! Each process would like to see its own, full, address space! Clearly impossible to provide full physical memory for all processes! Processes may define a large address space

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information