ECE 4750 Computer Architecture, Fall 2014 T05 FSM and Pipelined Cache Memories

ECE 4750 Computer Architecture, Fall 2014 T05 FSM and Pipelined Cache Memories School of Electrical and Computer Engineering Cornell University revision: 2014-10-08-02-02 1 FSM Set-Associative Cache Memory 2 2 Impact of Associativity on Microarchitecture 5 3 Pipelined Cache Microcarchitecture 7 4 Integrating Processors and Caches 12 5 Multi-Bank Cache Microarchitecture 15 6 Cache Optimizations 18 7 Cache Examples 24 1

1 FSM Set-Associative Cache Memory 1. FSM Set-Associative Cache Memory Hit Cache Miss Main Memory Average Memory Access Latency (AMAL) AMAL = ( Hit Time + ( Miss Rate Miss Penalty ) ) Cycle Time Microarchitecture Hit Time Cycle Time Topic 04 Single-Cycle Cache 1 long this topic FSM Cache >1 short this topic Pipelined Cache 1 short Technology Constraints Cache hit and main memory access will now take multiple cycles Assume more realistic technology with only single-ported SRAMs for tag and data array creq Control Datapath Tag Array 32b Control Unit cresp Data Array Status Change write policy from write-through, no allocate on write miss to write-back, allocate on write miss Memory mreq 128b mresp >1 cycle combinational 2

1 FSM Set-Associative Cache Memory Steps involved in executing transactions Check Tag Array hit READ Transactions Choose Victim Line clean miss dirty Read Victim from Data Array Check Tag Array hit WRITE Transactions Choose Victim Line clean miss dirty Read Victim from Data Array Write Victim to Main Mem Write Victim to Main Mem Read Line from Main Mem Read Line from Main Mem Write Line to Data Array Write Line to Data Array Read Word from Data Array Write Word to Data Array Return Word 3

1 FSM Set-Associative Cache Memory FSM Control Unit 4

2 Impact of Associativity on Microarchitecture 2. Impact of Associativity on Microarchitecture Direct Mapped Cache Microarchitecture addr tag idx off tag idx off 00 26b 2b 2b v tag Set 0 Set 1 Set 2 Set 3 = word enable hit? rd/wr ram enable wdata rdata Tag Check Data Access Tag Check Data Access Set Associative Cache Microarchitecture addr tag idx off tag idx off 00 27b 1b 2b v tag = = Set 0 Set 1 Set 0 Set 1 word enable 1b hit? rd/wr Way 0 Way 1 ram enable wdata rdata Tag Check Data Access Tag Check Data Access 5

2 Impact of Associativity on Microarchitecture Fully Associative Cache Microarchitecture addr tag off tag off 00 28b 2b v tag Set 0 Set 0 Set 0 Set 0 = = = = word enable 2b hit? rd/wr Way 0 Way 1 Way 2 Way 3 ram enable wdata rdata Tag Check Data Access Tag Check Data Access 6

3 Pipelined Cache Microcarchitecture 3. Pipelined Cache Microcarchitecture Given a sequence of three memory transactions: READ, WRITE, READ, draw a transaction vs. time diagram illustrating how these transactions execute on three different microarchitectures. Only consider hits, and assume each microarchitecture must perform two steps: MT = cache memory tag check, and MD = cache memory data access. Use Y to indicate cycle before cache access. Single-cycle cache memory (1-cycle hit latency) We do both MT and MD in a single cycle write FSM cache memory (2-cycle hit latency) We do both MT in the first state, and MD in the second state write Pipelined cache memory (2-cycle hit latency) We do both MT in the first stage, and MD in the second stage write 7

3 Pipelined Cache Microcarchitecture Direct-Mapped Parallel Read Hit Path addr tag idx off tag idx off 00 26b 2b 2b v tag Set 0 Set 1 Set 2 Set 3 = word enable hit? rd/wr wdata rdata Tag Check Tag Data CheckRead Access Access Tag Check Data Access Direct-Mapped Pipelined Write Hit Path addr tag idx off tag idx off 00 26b 2b 2b v tag Set 0 Set 1 Set 2 Set 3 = word enable hit? rd/wr write en wdata rdata Tag Check Tag Data CheckRead Access Access Write Access Tag Check Data Access 8

3 Pipelined Cache Microcarchitecture Set-Associative Parallel Read Hit Path addr tag idx off tag idx off 00 27b 1b 2b v tag = = Set 0 Set 1 Set 0 Set 1 Way 0 Way 1 1b hit? rdata Tag Check Read Access Tag Check Way 0 Way 1 Data Access Pipelined diagrams for parallel, pipelined write write 9

3 Pipelined Cache Microcarchitecture Resolving structural hazard with software scheduling write nop Resolving structural hazard with hardware stalling write stall_y = ( type_y == RD ) && ( type_m0 == WR ) Resolving structural hazard with hardware duplication write Resolving RAW data hazard with software scheduling Software scheduling: hazard depends on memory address, so difficult to know at compile time! Resolving RAW data hazard with hardware stalling stall_y = ( type_y == RD ) && ( type_m0 == WR ) && ( addr_y == addr_m0 ) 10

3 Pipelined Cache Microcarchitecture Resolving RAW data hazard with hardware bypassing We could use the previous stall signal as our bypass signal, but we will also need a new bypass path in our datapath. Draw this new bypass path on the following datapath diagram. addr tag idx off tag idx off 00 26b 2b 2b v tag Set 0 Set 1 Set 2 Set 3 = word enable hit? rd/wr write en wdata rdata Tag Check Tag Data CheckRead Access Access Write Access Tag Check Data Access 11

4 Integrating Processors and Caches 4. Integrating Processors and Caches Zero-Cycle Hit Latency with Tightly Coupled Interface decode cs_x cs_m cs_w br_targ jr j_targ pc_plus4 pc_f pc_sel_p stall_f addr +4 imem rdata nop pc_plus4_d kill_f stall_d ir_d stall_d ir[25:0] ir[15:0] ir[10:6] ir[25:21] ir[20:16] ir[15:0] ir[15:0] j_tgen br_tgen regfile () zext sext 16 op0 _sel_d br_targ _X op0_x op1_x sd_x op1 _sel_d bypass_from_x1 bypass_from_m bypass_from_w alu _func_x alu branch _cond result_m sd_m dmem _wen_m addr rdata wdata dmem wb_sel_m result_w Tag Check Data Access regfile _waddr_w regfile _wen_w regfile (write) Fetch (F) Decode & Reg Read (D) Execute (X) Memory (M) Writeback (W) Two-Cycle Hit Latency with Val/Rdy Interface decode cs_x cs_m0 cs_m1 cs_w br_targ jr j_targ pc_plus4 pc_sel_p!rdy pc_f0 stall_f0 memreq valrdy pc_plus4_f1 pc_plus4_d +4 imem memresp val kill_f1 ir_d nop stall_d stall_d ir[25:0] ir[15:0] ir[10:6] ir[25:21] ir[20:16] ir[15:0] ir[15:0] j_tgen br_tgen regfile () zext sext 16 op0 _sel_d br_targ _X op0_x op1_x sd_x op1 _sel_d alu _func_x alu!rdy branch _cond result_m0 result_m1 result_w memreq valrdy memresp val wb_sel_m dmem regfile _waddr_w regfile _wen_w regfile (write) bypass_from_x1 bypass_from_m bypass_from_w Tag Check Data Access Fetch (F0/F1) Decode & Reg Read (D) Execute (X) Memory (M0/M1) Writeback (W) 12

4 Integrating Processors and Caches Parallel Read, Pipelined Write Hit Path decode cs_x cs_m cs_w br_targ jr j_targ pc_plus4 pc_sel_p!rdy pc_f stall_f memreq valrdy +4 imem memresp val nop pc_plus4_d kill_f stall_d ir_d stall_d ir[25:0] ir[15:0] ir[10:6] ir[25:21] ir[20:16] ir[15:0] ir[15:0] j_tgen br_tgen regfile () zext sext 16 op0 _sel_d br_targ _X op0_x op1_x sd_x op1 _sel_d bypass_from_x1 bypass_from_m bypass_from_w alu _func_x alu branch _cond result_m memreq valrdy dmem _wen_m memresp val dmem wb_sel_m result_w regfile _waddr_w regfile _wen_w regfile (write) Fetch (F) Decode & Reg Read (D) Execute (X) Memory (M) Writeback (W)!rdy Tag Check Read Access New Hazards? Write Access Estimating execution time How long in cycles will it take to execute the vvadd example assuming n is 64? Assume cache is initially empty, parallel-/pipelined-write, four-way set-associative, and miss penalty is two cycles. loop: lw r12, 0(r4) lw r13, 0(r5) addu r14, r12, r13 sw r14, 0(r6) addiu r4, r4, 4 addiu r5, r5, 4 addiu r6, r6, 4 addiu r7, r7, -1 bne r7, r0, loop jr r31 13

4 Integrating Processors and Caches lw lw + sw +i +i +i +i bne opa opb lw lw + sw +i +i +i +i bne opa opb lw 14

5 Multi-Bank Cache Microarchitecture 5. Multi-Bank Cache Microarchitecture 15

5 Multi-Bank Cache Microarchitecture 16

5 Multi-Bank Cache Microarchitecture 17

6 Cache Optimizations 6. Cache Optimizations Hit Cache Miss Main Memory AMAL = ( Hit Time + ( Miss Rate Miss Penalty ) ) Cycle Time Reduce hit time Small and simple caches Reduce miss penalty Multi-level cache hierarchy Prioritize s Reduce miss rate Large block size Large cache size High associativity Compiler optimizations Reduce Cycle Time: Small & Simple Caches 18

6 Cache Optimizations Reduce Miss Rate: Large Block Size Less tag overhead Exploit fast burst transfers from DRAM and over wide on-chip busses Can waste bandwidth if data is not used Fewer blocks more conflicts Reduce Miss Rate: Large Cache Size or High Associativity If cache size is doubled, miss rate usually drops by about 2 Direct-mapped cache of size N has about the same miss rate as a two-way set-associative cache of size N/2 19

6 Cache Optimizations Reduce Miss Rate: Compiler Optimizations Restructuring code affects the data block access sequence Group data accesses together to improve spatial locality Re-order data accesses to improve temporal locality Prevent data from entering the cache Useful for variales that will only be accessed once before eviction Needs mechanism for software to tell hardware not to cache data ( no-allocate instruction hits or page table bits) Kill data that will never be used again Streaming data exploits spatial locality but not temporal locality Replace into dead-cache locations Loop Interchange and Fusion What type of locality does each optimization improve? for(j=0; j < N; j++) { for(i=0; i < M; i++) { x[i][j] = 2 * x[i][j]; } } for(i=0; i < N; i++) a[i] = b[i] * c[i]; for(i=0; i < N; i++) d[i] = a[i] * c[i]; for(i=0; i < M; i++) { for(j=0; j < N; j++) { x[i][j] = 2 * x[i][j]; } } What type of locality does this improve? for(i=0; i < N; i++) { a[i] = b[i] * c[i]; d[i] = a[i] * c[i]; } What type of locality does this impro 20

6 Cache Optimizations Matrix Multiply with Naive Code for(i=0; i < N; i++) for(j=0; j < N; j++) { r = 0; for(k=0; k < N; k++) r = r + y[i][k] * z[k][j]; x[i][j] = r; } z k j y k x j i i Not touched Old access New access Matrix Multiply with Cache Tiling for(jj=0; jj < N; jj=jj+b) for(kk=0; kk < N; kk=kk+b) for(i=0; i < N; i++) for(j=jj; j < min(jj+b,n); j++) { r = 0; for(k=kk; k < min(kk+b,n); k++) r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; } y k z k x j j i i 21

6 Cache Optimizations Reduce Miss Penalty: Multi-Level Caches Hit Processor L1 Cache L2 Cache L1 Miss -- L2 Hit Main Memory AMAL L1 = Hit Time of L1 + ( Miss Rate of L1 AMAL L2 ) AMAL L2 = Hit Time of L2 + ( Miss Rate of L2 Miss Penalty of L2 ) Local miss rate = misses in cache / accesses to cache Global miss rate = misses in cache / processor memory accesses Misses per instruction = misses in cache / number of instructions Reduce Miss Penalty: Multi-Level Caches Use smaller L1 is there is also a L2 Trade increased L1 miss rate for reduced L1 hit time & L1 miss penalty Reduces average access energy Use simpler write-through L1 with on-chip L2 Write-back L2 cahce absorbs write traffic, doesn t go off-chip Simplifies processor pipeline Simplifies on-chip coherence issues Inclusive Multilevel Cache Inner cache holds copy of data in outer cache External coherence is simpler Exclusive Multilevel Cache Inner cache may hold data in outer cache Swap lines between inner/outer cache on miss 22

6 Cache Optimizations Reduce Miss Penalty: Prioritize Reads Processor Hit Cache Write Buffer Main Memory Miss Processor not stalled on writes, and misses can go ahead of writes to main memory Write buffer may hold updated value of location needed by miss On miss, wait for write buffer to be empty Check write buffer addresses and bypass Cache Optimizations Impact on Average Memory Access Latency Hit Miss Miss Technique Time Rate Penalty HW Parallel hit 0 Pipelined write hit 1 Smaller caches ++ 0 Large block size + 0 Large cache size ++ 1 High associativity ++ 1 Compiler optimizations 0 Multi-level cache 2 Prioritize s 1 23

7 Cache Examples 7. Cache Examples Itanium-2 On-Chip Caches Intel Itanium-2 On-Chip Caches (Intel/HP, 2002) Level 1: 16KB, 4-way s.a., 64B line, quad-port (2 load+2 store), single cycle latency Level 2: 256KB, 4-way s.a, 128B line, quad-port (4 load or 4 store), five cycle latency Level 3: 3MB, 12-way s.a., 128B line, single 32B port, twelve cycle latency February 9, 2/17/2009 2010 CS152, Spring 2010 24 IBM Power-7 On-Chip Caches Power 7 On-Chip Caches [IBM 2009] 32KB L1 I$/core 32KB L1 D$/core 3-cycle latency 256KB Unified L2$/core 8-cycle latency 32MB Unified Shared L3$ Embedded DRAM 25-cycle latency to local slice February 9, 2010 CS152, Spring 2010 24 25