ECE 4750 Computer Architecture, Fall 2018 T04 Fundamental Memory Microarchitecture

Size: px

Start display at page:

Download "ECE 4750 Computer Architecture, Fall 2018 T04 Fundamental Memory Microarchitecture"

Eric Little
5 years ago
Views:

1 ECE 4750 Computer Arcecture, Fall 2018 T04 Fundamental Memory Microarcecture School of Electrical and Computer Engineering Cornell University revision: Memory Microarcectural Design Patterns Transactions and Steps Microarcecture Overview FSM Cache High-Level Idea for FSM Cache FSM Cache path FSM Cache Control Unit Analyzing Performance Pipelined Cache High-Level Idea for Pipelined Cache Pipelined Cache path and Control Unit Analyzing Performance Pipelined Cache with TLB Cache Microarcecture Optimizations Reduce Hit Time Reduce Miss Rate

2 4.3. Reduce Miss Penalty Cache Optimization Summary Case Study: ARM Cortex A8 and Intel Core i ARM Cortex A Intel Core i Copyright 2018 Christopher Batten, Christina Delimitrou. All rights reserved. This handout was originally prepared by Prof. Christopher Batten at Cornell University for ECE 4750 / CS It has since been updated by Prof. Christina Delimitrou. Download and use of this handout is permitted for individual educational non-commercial purposes only. Redistribution either in part or in whole via both commercial or non-commercial means requires written permission. 2

3 1. Memory Microarcectural Design Patterns 1.1. Transactions and Steps 1. Memory Microarcectural Design Patterns Time Mem Accesses Avg Cycles Mem Access Sequence Sequence Mem Access Time Cycle ( Avg Cycles Avg Cycles Num Misses + Mem Access Hit Num Accesses ) Avg Extra Cycles Miss Extra Accesses Microarcecture Hit Latency for Translation FSM Cache >1 1+ Pipelined Cache 1 1+ Pipelined Cache + TLB Transactions and Steps We can think of each memory access as a transaction Executing a memory access involves a sequence of steps Check : Check one or more s in Select Victim : Select victim line from using replacement policy Evict Victim : Evict victim line from and write victim to memory Refill : Refill requested line by ing line from memory Write Mem : Write requested word to memory Access : Read or write requested word in 3

4 1. Memory Microarcectural Design Patterns 1.1. Transactions and Steps Steps for Write-Through with No Write Allocate 1.2. Microarcecture Overview mmureq mmuresp Memory Management Unit req resp Control path Control Unit Status Steps for Write-Back with Write Allocate mreq mresp Main Memory >1 cycle combinational 4

5 2. FSM Cache 2. FSM Cache Time Mem Accesses Avg Cycles Mem Access Sequence Sequence Mem Access Time Cycle ( Avg Cycles Avg Cycles Num Misses + Mem Access Hit Num Accesses ) Avg Extra Cycles Miss Extra Accesses Microarcecture Hit Latency for Translation FSM Cache >1 1+ Pipelined Cache 1 1+ Pipelined Cache + TLB 1 0 Assumptions Page-based translation, no TLB, physically Single-ported combinational SRAMs for, storage Unrealistic combinational main memory Cache requests are 4 B Configuration mmureq req Control path Control Unit mmuresp Memory Management Unit resp Status Four 16 B lines Two-way set-associative Replacement policy: LRU Write policy: write-through, no write allocate 5 mreq Main Memory mresp >1 cycle combinational

6 2. FSM Cache 2.1. High-Level Idea for FSM Cache 2.1. High-Level Idea for FSM Cache Check miss Access write Select Victim Write Mem Refill write FSM Check Access Check write Access Write Mem Check Access Check miss Select Victim Refill Access Check Access 6

7 2. FSM Cache 2.2. FSM Cache path 2.2. FSM Cache path As with processors, we design our path by incrementally adding support for each transaction and resolving conflicts with muxes. Implementing READ transactions that!req_val 27b 1b 2b 00 MT tarray0_en Way 0 tarray0 tarray0_en Way 1 tarray1 darray_en 2b [127:96] [95:64] [63:32] [31:0] resp. MRD MT: Check MRD: Read array, return resp Implementing READ transactions that miss!req_val MT miss R0 R1 MRD z4b mem tarray0_en tarray0_wen Way 0 tarray0 27b 1b 2b tarray0_en tarray0_wen Way 1 00 memresp. 7 tarray1 victim victim _sel word enable 1111 darray_en darray_en darray_wen 2b [31:0] [127:96] [95:64] [63:32] resp. MT: Check MRD: Read array, return resp R0: Send refill memreq, get memresp R1: Write array with refill line

8 2. FSM Cache 2.2. FSM Cache path Implementing WRITE transactions that miss!req_val MT miss R0 R1 MRD write MWD z4b_sel tarray0_en tarray0_wen z4b z4b mem Way 0 tarray0 27b 1b 2b zext mem tarray0_en tarray0_wen Way 1 00 memresp. tarray1 victim victim _sel word enable 1111 darray_en darray_en darray_wen 2b [31:0] [127:96] [95:64] [63:32] resp. MT: Check MRD: Read array, return resp R0: Send refill memreq, get memresp R1: Write array with refill line MWD: Send write memreq, write array Implementing WRITE transactions that!req_val MT miss R0 R1 MRD write MWD z4b_sel tarray0_en tarray0_wen z4b z4b mem Way 0 tarray0 27b 1b 2b zext mem repl unit tarray0_en tarray0_wen Way 1 00 memresp. tarray1 victim victim _sel 1111 darray_ word_ write_ en_sel sel word enable word enable 1111 darray_en darray_en darray_wen 2b [31:0] [127:96] [95:64] [63:32] resp. MT: Check MRD: Read array, return resp R0: Send refill memreq, get memresp R1: Write array with refill line MWD: Send write memreq, write array 8

9 2. FSM Cache 2.3. FSM Cache Control Unit 2.3. FSM Cache Control Unit MT miss R0 R1 MRD write MWD We will need to keep valid bits in the control unit, with one valid bit for every line. We will also need to keep use bits which are updated on every access to indicate which was the last used line. Assume we create the following two control signals generated by the FSM control unit. ( tarray0 && valid0[] ) ( tarray1 && valid1[] ) victim!use[] MT MRD R0 R1 MWD tarray0 tarray1 darray darray worden z4b memreq cachresp en wen en wen en wen sel sel sel val op val 9

10 2. FSM Cache 2.4. Analyzing Performance 2.4. Analyzing Performance Time Mem Accesses Avg Cycles Mem Access Sequence Sequence Mem Access Time Cycle ( Avg Cycles Avg Cycles Num Misses + Mem Access Hit Num Accesses Estimating cycle time ) Avg Extra Cycles Miss!req_val MT miss R0 R1 MRD write MWD z4b_sel tarray0_en tarray0_wen z4b z4b mem Way 0 register /write 1τ array /write 10τ array /write 10τ mem /write 20τ decoder 3τ comparator 10τ mux 3τ repl unit 0τ z4b 0τ tarray0 27b 1b 2b zext mem repl unit tarray0_en tarray0_wen Way 1 00 memresp. 10 tarray1 victim victim _sel 1111 darray_ word_ write_ en_sel sel word enable word enable 1111 darray_en darray_en darray_wen 2b [31:0] [127:96] [95:64] [63:32] resp. MT: Check MRD: Read array, return resp R0: Send refill memreq, get memresp R1: Write array with refill line MWD: Send write memreq, write array

11 2. FSM Cache 2.4. Analyzing Performance Estimating AMAL Consider the following sequence of memory acceses which might correspond to copying 4 B elements from a source array to a destination array. Each array contains 64 elements. What is the AMAL? rd 0x1000 wr 0x2000 rd 0x1004 wr 0x2004 rd 0x1008 wr 0x rd 0x1040 wr 0x2040 Consider the following sequence of memory acceses which might correspond to incrementing 4 B elements in an array. The array contains 64 elements. What is the AMAL? rd 0x1000 wr 0x1000 rd 0x1004 wr 0x1004 rd 0x1008 wr 0x rd 0x1040 wr 0x

12 3. Pipelined Cache 3. Pipelined Cache Time Mem Accesses Avg Cycles Mem Access Sequence Sequence Mem Access Time Cycle ( Avg Cycles Avg Cycles Num Misses + Mem Access Hit Num Accesses ) Avg Extra Cycles Miss Extra Accesses Microarcecture Hit Latency for Translation FSM Cache >1 1+ Pipelined Cache 1 1+ Pipelined Cache + TLB 1 0 Assumptions Page-based translation, no TLB, physically Single-ported combinational SRAMs for, storage Unrealistic combinational main memory Cache requests are 4 B Configuration mmureq req Control path Control Unit mmuresp Memory Management Unit resp Status Four 16 B lines Direct-mapped Replacement policy: LRU Write policy: write-through, no write allocate 12 mreq Main Memory mresp >1 cycle combinational

13 3. Pipelined Cache 3.1. High-Level Idea for Pipelined Cache 3.1. High-Level Idea for Pipelined Cache Check miss Access write Select Victim Write Mem Refill write FSM Check Access Check write Access Write Mem Check Access Check miss Select Victim Refill Access Check Access Check Access Pipelined Check write Mem Write Access Check Access Check miss Select Victim Refill Access Check Access 13

14 3. Pipelined Cache 3.2. Pipelined Cache path and Control Unit 3.2. Pipelined Cache path and Control Unit As with processors, we incrementally adding support for each transaction and resolving conflicts using muxes. Implementing READ transactions that M0 Se tarray M1 Se tarray_en 26b 2b 2b 00 [127:96] [95:64] [63:32] [31:0] resp. Implementing WRITE transactions that M0 Se tarray M1 Se tarray_en repl 26b 2b 2b unit zext 00 darray _wen [127:96] [95:64] [63:32] [31:0] resp. mem mem 14

15 3. Pipelined Cache 3.2. Pipelined Cache path and Control Unit Implementing transactions that miss M0 Se FSM pipe M0 Se tarray M1 Se R0 R1 miss z4b_sel z4b tarray_en tarray_wen 00 repl 26b 2b 2b unit zext darray_ write_ mux_sel word_en_sel 1111 write word enable darray _wen [127:96] [95:64] [63:32] [31:0] resp. mem mem memresp. Hybrid pipeline/fsm design pattern Hit path is pipelined with two-cycle latency Miss path stalls in M0 se to refill line Pipeline diagram for pipelined with 2-cycle latency rd () wr () rd (miss) rd (miss) wr (miss) rd () 15

16 3. Pipelined Cache 3.2. Pipelined Cache path and Control Unit Parallel with pipelined write path M0 Se FSM pipe R0 R1 miss z4b_sel z4b tarray_en tarray_wen M0 Se tarray 00 repl 26b 2b 2b unit zext darray_ write_ mux_sel parallel sel word_en_sel 1111 write word enable darray _wen M1 Se [127:96] [95:64] [63:32] [31:0] resp. mem mem memresp. Pipeline diagram for parallel with pipelined write rd () rd () rd () wr () wr () rd () Achieves single-cycle latency for s Two-cycle latency for writes, but is this latency observable? With write acks, send write-ack back in M0 se How do we resolve structural hazards? 16

17 3. Pipelined Cache 3.2. Pipelined Cache path and Control Unit Resolving structural hazard by exposing in ISA rd () wr () nop rd () Resolving structural hazard with hardware stalling rd () wr () rd () ostall_m0 val_m0 && ( type_m0 RD ) && val_m1 && ( type_m1 WR ) Resolving structural hazard with hardware duplication rd () wr () rd () Resolving RAW hazard with software scheduling Software scheduling: hazard depends on memory ess, so difficult to know at compile time! Resolving RAW hazard with hardware stalling ostall_m

18 3. Pipelined Cache 3.2. Pipelined Cache path and Control Unit Resolving RAW hazard with hardware bypassing We could use the previous stall signal as our bypass signal, but we will also need a new bypass path in our path. Draw this new bypass path on the following path diagram. M0 Se FSM pipe M0 Se tarray M1 Se R0 R1 miss z4b_sel z4b tarray_en tarray_wen 00 repl 26b 2b 2b unit zext darray_ write_ mux_sel word_en_sel 1111 write word enable darray _wen write [127:96] [95:64] [63:32] [31:0] resp. mem mem memresp. 18

19 3. Pipelined Cache 3.2. Pipelined Cache path and Control Unit Parallel and pipelined write in set associative s To implement parallel in set-associative s, we must speculatively a line from each way in parallel with check. This can be expensive in terms of latency, motivating two-cycle latencies for highly associative s. M0 Se FSM pipe R0 R1 miss z4b_sel tarray0_en tarray0_wen z4b Way 0 M0 Se tarray0 tarray0_en tarray0_wen Way 1 tarray1 00 repl 26b 2b 2b unit zext darray_ write_ mux_sel parallel sel word_en_sel 1111 word enable darray_en darray _wen M1 Se Way 0 Way 1 way_sel [127:96] [95:64] [63:32] [31:0] resp. mem mem memresp. 19

20 3. Pipelined Cache 3.3. Analyzing Performance 3.3. Analyzing Performance Time Mem Accesses Avg Cycles Mem Access Sequence Sequence Mem Access Time Cycle ( Avg Cycles Avg Cycles Num Misses + Mem Access Hit Num Accesses Estimating cycle time ) Avg Extra Cycles Miss M0 Se FSM pipe R0 R1 miss z4b_sel z4b tarray_en tarray_wen M0 Se tarray 00 repl 26b 2b 2b unit zext darray_ write_ mux_sel parallel sel word_en_sel 1111 write word enable darray _wen M1 Se [127:96] [95:64] [63:32] [31:0] resp. mem mem memresp. register /write 1τ array /write 10τ array /write 10τ mem /write 20τ decoder 3τ comparator 10τ mux 3τ repl unit 0τ z4b 0τ 20

21 3. Pipelined Cache 3.3. Analyzing Performance Estimating AMAL Assume a parallel-/pipelined-write microarcecture with hardware duplication to resolve the structural hazard and stalling to resolve RAW hazards. Consider the following sequence of memory accesses which might correspond to copying 4 B elements from a source array to a destination array. Each array contains 64 elements. What is the AMAL? rd 0x1000 wr 0x2000 rd 0x1004 wr 0x2004 rd 0x1008 wr 0x2008 rd 0x100c wr 0x200c wr 0x

22 3. Pipelined Cache 3.4. Pipelined Cache with TLB 3.4. Pipelined Cache with TLB How should we integrate a MMU (TLB) into a pipelined? Physically Addressed Caches Perform memory translation before access Processor VA TLB PA Cache PA Main Memory Advanes: Physical esses are unique, so entries are unique Updating memory translation simply requires changing TLB Disadvanes: Increases latency Virtually Addressed Caches Perform memory translation after or in parallel with access Processor VA Cache VA TLB PA Main Memory Processor VA Cache PA Main Memory TLB 22

23 3. Pipelined Cache 3.4. Pipelined Cache with TLB Advanes: Simple one-step process for s Disadvanes: Intra-program protection (store protection bits in?) I/O uses physical (map into virtual space?) Virtual ess homonyms Virtual ess synonyms (aliases) Virtual Address Homonyms Single virtual ess points to two physical ess. Physical Address Space VA0 0x3000 VA0 0x3000 Program 1 Page Table Program 2 Page Table Physical Page Physical Page Physical Page PA0 0x7000 PA1 0x5000 V Way 0 1 0x3000 0xcafe Way 1 Way 2 Way 3 Physical Page Page in PMem Page not allocated Example scenario Program 1 brings VA 0x3000 into Program 1 is context swapped for program 2 Program 2 s in, but gets incorrect! Potential solutions Flush on context swap Store program ids (ess space IDS) in 23

24 3. Pipelined Cache 3.4. Pipelined Cache with TLB Virtual Address Synonyms (Aliases) Physical Address Space VA0 0x3000 VA1 0x1000 Program 1 Page Table Program 2 Page Table Physical Page Physical Page Physical Page Physical Page PA0 0x4000 V Way 0 1 0x3000 0xcafe Way 1 1 0x1000 0xbeef Way 2 Way 3 Page in PMem Page not allocated Example scenarios User and OS can potentially have different VAs point to same PA Memory map the same file (via mmap) in two different programs Potential solutions Hardware checks all ways (and potentially different sets) on a miss to ensure that a given physical ess can only live in one location in Software forces aliases to share some ess bits (page coloring) reducing the number of sets we need to check on a miss (or reducing the need to check any other locations for a direct mapped ) 24

25 3. Pipelined Cache 3.4. Pipelined Cache with TLB Virtually Indexed and Physically ged Caches Virtual Address VPN Virtual Page Num (v bits) Page Offset (m bits) Virtual Index (n bits) TLB Direct Mapped $ 2 n lines 2 b -byte line PPN Physical (k bits) Page set is the same in VA and PA? Physical (k bits) Up to n bits of physical ess available without translation If index bits + set (n+b) < page set bits (m), can do translation in parallel with ing out the physical and Complete (physical) check once / access complete With 4 KB pages, direct-mapped must be < 4 KB Larger page sizes (decrease k) or higher associativity (decrease n) enable larger virtually indexed, physically ged s 25

26 4. Cache Microarcecture Optimizations 4.2. Reduce Hit Time 4. Cache Microarcecture Optimizations Processor Hit Latency Miss Penalty Cache MMU Main Memory AMAL Hit Latency + ( Miss Rate Miss Penalty ) Reduce time Reduce miss rate Small and simple s Reduce miss penalty Multi-level hierarchy Prioritize s Large block size Large size High associativity Hardware prefetching Compiler optimizations 4.1. Reduce Hit Time Cache Microarcecture Optimizations way 4-way Ten Advanced Optimizations of Cache Performance 81 2-way 8-way Energy per in nano joules KB 32 KB 64 KB Cache size 128 KB 256 KB Figure 2.4 Energy consumption per increases as size and associativity are increased. As in the previous figure, CACTI is used for the modeling with the same technology parameters. The large penalty for eight-way set associative s is due to the cost of ing out eight s and the corresponding in parallel. Second Optimization: Way Prediction to Reduce Hit Time

Reduce Miss Rate Large Block Size Less overhead Exploit fast burst transfers from DRAM and over wide on-chip

27 4. Cache Microarcecture Optimizations 4.2. Reduce Miss Rate 4.2. Reduce Miss Rate Large Block Size Less overhead Exploit fast burst transfers from DRAM and over wide on-chip busses Can waste bandwidth if is not used Fewer blocks more conflicts Large Cache Size or High Associativity If size is doubled, miss rate usually drops by about 2 Direct-mapped of size N has about the same miss rate as a two-way set-associative of size N/2 27

28 4. Cache Microarcecture Optimizations 4.2. Reduce Miss Rate Hardware Prefetching Proc L1 D$ Prefetch Buffer Main Memory Cache Line Previous techniques only help capacity and conflict misses Hardware prefetcher looks for patterns in miss ess stream Attempts to predict what the next miss might be Prefetches this next miss into a pfetch buffer Very effective in reducing compulsory misses for streaming accesses 28

29 4. Cache Microarcecture Optimizations 4.2. Reduce Miss Rate Compiler Optimizations Restructuring code affects the block access sequence Group accesses together to improve spatial locality Re-order accesses to improve temporal locality Prevent from entering the Useful for variales that will only be accessed once before eviction Needs mechanism for software to tell hardware not to ( no-allocate instruction s or page table bits) Kill that will never be used again Streaming exploits spatial locality but not temporal locality Replace into dead- locations Loop Interchange and Fusion What type of locality does each optimization improve? for(j0; j < N; j++) { for(i0; i < M; i++) { x[i][j] 2 * x[i][j]; } } for(i0; i < N; i++) a[i] b[i] * c[i]; for(i0; i < N; i++) d[i] a[i] * c[i]; for(i0; i < M; i++) { for(j0; j < N; j++) { x[i][j] 2 * x[i][j]; } } What type of locality does this improve? for(i0; i < N; i++) { a[i] b[i] * c[i]; d[i] a[i] * c[i]; } What type of locality does this impro 29

30 4. Cache Microarcecture Optimizations 4.2. Reduce Miss Rate Matrix Multiply with Naive Code for(i0; i < N; i++) for(j0; j < N; j++) { r 0; for(k0; k < N; k++) r r + y[i][k] * z[k][j]; x[i][j] r; } z k j y k x j i i Not touched Old access New access Matrix Multiply with Cache Tiling for(jj0; jj < N; jjjj+b) for(kk0; kk < N; kkkk+b) for(i0; i < N; i++) for(jjj; j < min(jj+b,n); j++) { r 0; for(kkk; k < min(kk+b,n); k++) r r + y[i][k] * z[k][j]; x[i][j] x[i][j] + r; } y k z k x j j i i 30

31 4. Cache Microarcecture Optimizations 4.3. Reduce Miss Penalty 4.3. Reduce Miss Penalty Multi-Level Caches Hit Processor L1 Cache L2 Cache L1 Miss -- L2 Hit Main Memory AMAL L1 Hit Latency of L1 + ( Miss Rate of L1 AMAL L2 ) AMAL L2 Hit Latency of L2 + ( Miss Rate of L2 Miss Penalty of L2 ) Local miss rate misses in / accesses to Global miss rate misses in / processor memory accesses Misses per instruction misses in / number of instructions Use smaller L1 is there is also a L2 Trade increased L1 miss rate for reduced L1 time & L1 miss penalty Reduces average access energy Use simpler write-through L1 with on-chip L2 Write-back L2 cahce absorbs write traffic, doesn t go -chip Simplifies processor pipeline Simplifies on-chip coherence issues Inclusive Multilevel Cache Inner holds copy of in outer External coherence is simpler Exclusive Multilevel Cache Inner may hold in outer Swap lines between inner/outer on miss 31

32 4. Cache Microarcecture Optimizations 4.4. Cache Optimization Summary Prioritize Reads Processor Hit Cache Write Buffer Main Memory Miss Processor not stalled on writes, and misses can go ahead of writes to main memory Write buffer may hold updated value of location needed by miss On miss, wait for write buffer to be empty Check write buffer esses and bypass 4.4. Cache Optimization Summary Hit Miss Miss Technique Lat Rate Penalty BW HW Smaller s + 0 Avoid TLB before indexing 1 Large block size + 0 Large size + 1 High associativity + 1 Hardware prefetching 2 Compiler optimizations 0 Multi-level 2 Prioritize s 1 Pipelining

33 5. Case Study: ARM Cortex A8 and Intel Core i ARM Cortex A8 Performance of the Cortex-A8 Memory Hierarchy 5. Case Study: ARM Cortex A8 and Intel Core i ARM Cortex A8 TLB and the s, assuming 32 KB primary s and a 512 KB secondary with 16 KB page size. The memory hierarchy of the Cortex-A8 was simulated with 32 KB primary s and a 1 MB eight-way set associative L2 using the integer Minnespec benchmarks (see KleinOsowski and Lilja [2002]). Minnespec is a set of benchmarks consisting of the SPEC2000 benchmarks but with different inputs that reduce the running times by several orders of magnitude. Although the use of smaller inputs does not change the instruction mix, it does affect the Virtual add ress <32> Virtual page number <18> Page set <14> TLB <19> TLB <19> L1 index <7> Block set <6> To CPU L1 <19> L1 <64>? To CPU Physical ess <32> L2 compare ess <15> L2 index <11> Block set <6> To CPU L2 <15> L2 <512>? To L1 or CPU L1 Figure 2.16 The virtual ess, physical ess, indexes, s, and blocks for the ARM Cortex-A8 s and TLB. Since the instruction and hierarchies are symmetric, we show only one. The TLB (instruction or ) is fully associative with 32 entries. The L1 is four-way set associative with 64-byte blocks and 32 KB capacity. The L2 is eight-way set associative with 64-byte blocks and 1 MB capacity. This figure doesn t show the valid bits and protection bits for the s and TLB, nor the use of the way prediction bits that would dictate the predicted bank of the L1. 32 KB, 64 B lines, 4-way set-associative with random replacement 32K/ lines, 128 lines per set (set index 7 bits) Virtually indexed, physically ged with single-cycle latency Memory management unit TLB with multi-level page tables in physical memory TLB has 32 entries, fully associative, hardware TLB miss handler Variable page size: 4 KB, 16 KB, 64 KB, 1 MB, 16 MB (fig shows 16 KB) TLB entries can have wildcards to support multiple page sizes L2 1 MB, 64 B lines, 8-way set-associative 1M/64 16K lines, 2K lines per set (set index 11 bits) Physically essed with multi-cycle latency 33

34 Figure 2.19 Characteristics of the i7 s TLB structure, which has separate first-level 5. instruction Case Study: and ARM Cortex TLBs, A8both and Intel backed Core by i7a joint second-level TLB The Intel first-level Core i7 TLBs support the standard 4 KB page size, as well as having a limited number of entries of large 2 to 4 MB pages; only 4 KB pages are supported in the second-level TLB Intel Core i7 wo Memory Hierarchy Design Characteristic L1 L2 L3 described Size in Chapter In KB 2010, I/32 KB the Dfastest 256 i7 had KB a clock rate 2 MB of 3.3 per core GHz, which yields Associativity a peak instruction 4-way execution I/8-way rate D of waybillion instructions 16-way per second, or over 50 billion instructions per second for the four-core design. Access latency 4 cycles, pipelined 10 cycles 35 cycles The i7 can support up to three memory channels, each consisting of a separate set of DIMMs, and each of which can transfer in parallel. Using DDR Replacement scheme Pseudo-LRU Pseudo- Pseudo-LRU but with an LRU ordered selection algorihtm (DIMM PC8500), the i7 has a peak memory bandwith of just over 25 GB/sec. Figure Write-back i7 uses bit Characteristics with virtual merging esses of write the and three-level buffer 36-bit (more physical like hierarchy no esses, write allocate) in yielding the i7. All a maximum s physical use write-back memory and of a block 36 GB. size Memory of 64 bytes. management The L1 and is L2 handled s with are separate a two- three L3 is inclusive of L1/L2 level for each TLB core, (see while Appendix the L3 B, Section is shared B.4), among summarized the cores in on Figure a chip and is a total of 2 MB Hardware per core. All prefetching three s from are L2 nonblocking into L1, from and allow L3 into multiple L2 Figure 2.20 summarizes the i7 s three-level hierarchy. outstanding The first-level writes. Amerging Virtuallywrite indexed, buffer physically is used for the ged L1, L1 s s are virtually indexed and physically ged which (see holds Appendix in the B, event Section that B.3), the line Physically is not present essed in L1 when L2/L3 it is written. s (That is, an L1 write miss does not cause the while the L2 and L3 s are physically indexed. Figure 2.21 is labeled with the line to be allocated.) L3 is inclusive of L1 and L2; we explore this property in further detail when we explain multiprocessor s. Replacement is by a variant on pseudo- Characteristic LRU; in the case of L3 Instruction the block TLB replaced is always DLB the lowest Second-level numbered TLB way whose access bit is turned. This is not quite random but is easy to compute. Size Associativity 4-way 4-way 4-way Replacement Pseudo-LRU Pseudo-LRU Pseudo-LRU Access latency 1 cycle 1 cycle 6 cycles Miss 7 cycles 7 cycles Hundreds of cycles to access page table Figure 48 bit 2.19 virtual Characteristics esses and of the 36 i7 s bit physical TLB structure, esses which (36 has GB separate physical first-level mem) instruction and TLBs, both backed by a joint second-level TLB. The first-level 4 KB pages except for few large 2 4MB pages in L1 TLBs TLBs support the standard 4 KB page size, as well as having a limited number of entries of large 2 to 4 MB pages; only 4 KB pages are supported in the second-level TLB. Characteristic L1 L2 L3 Size 32 KB I/32 KB D 256 KB 2 MB per core Associativity 4-way I/8-way D 8-way 16-way Access latency 4 cycles, pipelined 3410 cycles 35 cycles Replacement scheme Pseudo-LRU Pseudo- Pseudo-LRU but with an

ECE 4750 Computer Architecture, Fall 2014 T05 FSM and Pipelined Cache Memories

ECE 4750 Computer Architecture, Fall 2014 T05 FSM and Pipelined Cache Memories School of Electrical and Computer Engineering Cornell University revision: 2014-10-08-02-02 1 FSM Set-Associative Cache Memory