ECE 4750 Computer Architecture, Fall 2014 T05 FSM and Pipelined Cache Memories

Similar documents
EE 660: Computer Architecture Advanced Caches

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

ECE 4750 Computer Architecture, Fall 2018 T04 Fundamental Memory Microarchitecture

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 7 Memory III

CS 3410 Computer System Organization and Programming Guest Lecture: I/O Devices

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory

ECE 4750 Computer Architecture, Fall 2014 T03 Pipelined Processors

Lecture 11 Cache. Peng Liu.

Advanced Caching Techniques

CSC 631: High-Performance Computer Architecture

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

Page 1. Multilevel Memories (Improving performance using a little cash )

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

ECE 4750 Computer Architecture, Fall 2014 T01 Single-Cycle Processors

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Advanced Caching Techniques

Inside out of your computer memories (III) Hung-Wei Tseng

EITF20: Computer Architecture Part4.1.1: Cache - 2

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Chapter 2: Memory Hierarchy Design Part 2

Lecture notes for CS Chapter 2, part 1 10/23/18

Cache Performance (H&P 5.3; 5.5; 5.6)

EC 513 Computer Architecture

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Computer Architecture Spring 2016

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Chapter 2: Memory Hierarchy Design Part 2

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories

Mo Money, No Problems: Caches #2...

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Memory Hierarchy. Slides contents from:

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Realistic Memories and Caches

Memory Hierarchy. Slides contents from:

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

CS/ECE 250 Computer Architecture

Why Care About Memory Hierarchy?

What is Cache Memory? EE 352 Unit 11. Motivation for Cache Memory. Memory Hierarchy. Cache Definitions Cache Address Mapping Cache Performance

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15

CS422 Computer Architecture

Lecture 16: Cache in Context (Uniprocessor) James C. Hoe Department of ECE Carnegie Mellon University

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

CS3350B Computer Architecture

14:332:331. Week 13 Basics of Cache

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

14:332:331. Week 13 Basics of Cache

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

ECE 2300 Digital Logic & Computer Organization. Caches

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name:

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

EE 4683/5683: COMPUTER ARCHITECTURE

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

ECE 4750 Computer Architecture, Fall 2017 T03 Fundamental Memory Concepts

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

ECE 4750 Computer Architecture, Fall 2016 T02 Fundamental Processor Microarchitecture

Advanced Computer Architecture

Lec 13: Linking and Memory. Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University. Announcements

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

EITF20: Computer Architecture Part4.1.1: Cache - 2

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

A Cache Hierarchy in a Computer System

Show Me the $... Performance And Caches

Lecture 11 Cache and Virtual Memory

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

6.823 Computer System Architecture Datapath for DLX Problem Set #2

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

ECE 2300 Digital Logic & Computer Organization. More Caches Measuring Performance

ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

Cache performance Outline

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

CS252 Prerequisite Quiz. Solutions Fall 2007

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Transcription:

ECE 4750 Computer Architecture, Fall 2014 T05 FSM and Pipelined Cache Memories School of Electrical and Computer Engineering Cornell University revision: 2014-10-08-02-02 1 FSM Set-Associative Cache Memory 2 2 Impact of Associativity on Microarchitecture 5 3 Pipelined Cache Microcarchitecture 7 4 Integrating Processors and Caches 12 5 Multi-Bank Cache Microarchitecture 15 6 Cache Optimizations 18 7 Cache Examples 24 1

1 FSM Set-Associative Cache Memory 1. FSM Set-Associative Cache Memory Hit Cache Miss Main Memory Average Memory Access Latency (AMAL) AMAL = ( Hit Time + ( Miss Rate Miss Penalty ) ) Cycle Time Microarchitecture Hit Time Cycle Time Topic 04 Single-Cycle Cache 1 long this topic FSM Cache >1 short this topic Pipelined Cache 1 short Technology Constraints Cache hit and main memory access will now take multiple cycles Assume more realistic technology with only single-ported SRAMs for tag and data array creq Control Datapath Tag Array 32b Control Unit cresp Data Array Status Change write policy from write-through, no allocate on write miss to write-back, allocate on write miss Memory mreq 128b mresp >1 cycle combinational 2

1 FSM Set-Associative Cache Memory Steps involved in executing transactions Check Tag Array hit READ Transactions Choose Victim Line clean miss dirty Read Victim from Data Array Check Tag Array hit WRITE Transactions Choose Victim Line clean miss dirty Read Victim from Data Array Write Victim to Main Mem Write Victim to Main Mem Read Line from Main Mem Read Line from Main Mem Write Line to Data Array Write Line to Data Array Read Word from Data Array Write Word to Data Array Return Word 3

1 FSM Set-Associative Cache Memory FSM Control Unit 4

2 Impact of Associativity on Microarchitecture 2. Impact of Associativity on Microarchitecture Direct Mapped Cache Microarchitecture addr tag idx off tag idx off 00 26b 2b 2b v tag Set 0 Set 1 Set 2 Set 3 = word enable hit? rd/wr ram enable wdata rdata Tag Check Data Access Tag Check Data Access Set Associative Cache Microarchitecture addr tag idx off tag idx off 00 27b 1b 2b v tag = = Set 0 Set 1 Set 0 Set 1 word enable 1b hit? rd/wr Way 0 Way 1 ram enable wdata rdata Tag Check Data Access Tag Check Data Access 5

2 Impact of Associativity on Microarchitecture Fully Associative Cache Microarchitecture addr tag off tag off 00 28b 2b v tag Set 0 Set 0 Set 0 Set 0 = = = = word enable 2b hit? rd/wr Way 0 Way 1 Way 2 Way 3 ram enable wdata rdata Tag Check Data Access Tag Check Data Access 6

3 Pipelined Cache Microcarchitecture 3. Pipelined Cache Microcarchitecture Given a sequence of three memory transactions: READ, WRITE, READ, draw a transaction vs. time diagram illustrating how these transactions execute on three different microarchitectures. Only consider hits, and assume each microarchitecture must perform two steps: MT = cache memory tag check, and MD = cache memory data access. Use Y to indicate cycle before cache access. Single-cycle cache memory (1-cycle hit latency) We do both MT and MD in a single cycle write FSM cache memory (2-cycle hit latency) We do both MT in the first state, and MD in the second state write Pipelined cache memory (2-cycle hit latency) We do both MT in the first stage, and MD in the second stage write 7

3 Pipelined Cache Microcarchitecture Direct-Mapped Parallel Read Hit Path addr tag idx off tag idx off 00 26b 2b 2b v tag Set 0 Set 1 Set 2 Set 3 = word enable hit? rd/wr wdata rdata Tag Check Tag Data CheckRead Access Access Tag Check Data Access Direct-Mapped Pipelined Write Hit Path addr tag idx off tag idx off 00 26b 2b 2b v tag Set 0 Set 1 Set 2 Set 3 = word enable hit? rd/wr write en wdata rdata Tag Check Tag Data CheckRead Access Access Write Access Tag Check Data Access 8

3 Pipelined Cache Microcarchitecture Set-Associative Parallel Read Hit Path addr tag idx off tag idx off 00 27b 1b 2b v tag = = Set 0 Set 1 Set 0 Set 1 Way 0 Way 1 1b hit? rdata Tag Check Read Access Tag Check Way 0 Way 1 Data Access Pipelined diagrams for parallel, pipelined write write 9

3 Pipelined Cache Microcarchitecture Resolving structural hazard with software scheduling write nop Resolving structural hazard with hardware stalling write stall_y = ( type_y == RD ) && ( type_m0 == WR ) Resolving structural hazard with hardware duplication write Resolving RAW data hazard with software scheduling Software scheduling: hazard depends on memory address, so difficult to know at compile time! Resolving RAW data hazard with hardware stalling stall_y = ( type_y == RD ) && ( type_m0 == WR ) && ( addr_y == addr_m0 ) 10

3 Pipelined Cache Microcarchitecture Resolving RAW data hazard with hardware bypassing We could use the previous stall signal as our bypass signal, but we will also need a new bypass path in our datapath. Draw this new bypass path on the following datapath diagram. addr tag idx off tag idx off 00 26b 2b 2b v tag Set 0 Set 1 Set 2 Set 3 = word enable hit? rd/wr write en wdata rdata Tag Check Tag Data CheckRead Access Access Write Access Tag Check Data Access 11

4 Integrating Processors and Caches 4. Integrating Processors and Caches Zero-Cycle Hit Latency with Tightly Coupled Interface decode cs_x cs_m cs_w br_targ jr j_targ pc_plus4 pc_f pc_sel_p stall_f addr +4 imem rdata nop pc_plus4_d kill_f stall_d ir_d stall_d ir[25:0] ir[15:0] ir[10:6] ir[25:21] ir[20:16] ir[15:0] ir[15:0] j_tgen br_tgen regfile () zext sext 16 op0 _sel_d br_targ _X op0_x op1_x sd_x op1 _sel_d bypass_from_x1 bypass_from_m bypass_from_w alu _func_x alu branch _cond result_m sd_m dmem _wen_m addr rdata wdata dmem wb_sel_m result_w Tag Check Data Access regfile _waddr_w regfile _wen_w regfile (write) Fetch (F) Decode & Reg Read (D) Execute (X) Memory (M) Writeback (W) Two-Cycle Hit Latency with Val/Rdy Interface decode cs_x cs_m0 cs_m1 cs_w br_targ jr j_targ pc_plus4 pc_sel_p!rdy pc_f0 stall_f0 memreq valrdy pc_plus4_f1 pc_plus4_d +4 imem memresp val kill_f1 ir_d nop stall_d stall_d ir[25:0] ir[15:0] ir[10:6] ir[25:21] ir[20:16] ir[15:0] ir[15:0] j_tgen br_tgen regfile () zext sext 16 op0 _sel_d br_targ _X op0_x op1_x sd_x op1 _sel_d alu _func_x alu!rdy branch _cond result_m0 result_m1 result_w memreq valrdy memresp val wb_sel_m dmem regfile _waddr_w regfile _wen_w regfile (write) bypass_from_x1 bypass_from_m bypass_from_w Tag Check Data Access Fetch (F0/F1) Decode & Reg Read (D) Execute (X) Memory (M0/M1) Writeback (W) 12

4 Integrating Processors and Caches Parallel Read, Pipelined Write Hit Path decode cs_x cs_m cs_w br_targ jr j_targ pc_plus4 pc_sel_p!rdy pc_f stall_f memreq valrdy +4 imem memresp val nop pc_plus4_d kill_f stall_d ir_d stall_d ir[25:0] ir[15:0] ir[10:6] ir[25:21] ir[20:16] ir[15:0] ir[15:0] j_tgen br_tgen regfile () zext sext 16 op0 _sel_d br_targ _X op0_x op1_x sd_x op1 _sel_d bypass_from_x1 bypass_from_m bypass_from_w alu _func_x alu branch _cond result_m memreq valrdy dmem _wen_m memresp val dmem wb_sel_m result_w regfile _waddr_w regfile _wen_w regfile (write) Fetch (F) Decode & Reg Read (D) Execute (X) Memory (M) Writeback (W)!rdy Tag Check Read Access New Hazards? Write Access Estimating execution time How long in cycles will it take to execute the vvadd example assuming n is 64? Assume cache is initially empty, parallel-/pipelined-write, four-way set-associative, and miss penalty is two cycles. loop: lw r12, 0(r4) lw r13, 0(r5) addu r14, r12, r13 sw r14, 0(r6) addiu r4, r4, 4 addiu r5, r5, 4 addiu r6, r6, 4 addiu r7, r7, -1 bne r7, r0, loop jr r31 13

4 Integrating Processors and Caches lw lw + sw +i +i +i +i bne opa opb lw lw + sw +i +i +i +i bne opa opb lw 14

5 Multi-Bank Cache Microarchitecture 5. Multi-Bank Cache Microarchitecture 15

5 Multi-Bank Cache Microarchitecture 16

5 Multi-Bank Cache Microarchitecture 17

6 Cache Optimizations 6. Cache Optimizations Hit Cache Miss Main Memory AMAL = ( Hit Time + ( Miss Rate Miss Penalty ) ) Cycle Time Reduce hit time Small and simple caches Reduce miss penalty Multi-level cache hierarchy Prioritize s Reduce miss rate Large block size Large cache size High associativity Compiler optimizations Reduce Cycle Time: Small & Simple Caches 18

6 Cache Optimizations Reduce Miss Rate: Large Block Size Less tag overhead Exploit fast burst transfers from DRAM and over wide on-chip busses Can waste bandwidth if data is not used Fewer blocks more conflicts Reduce Miss Rate: Large Cache Size or High Associativity If cache size is doubled, miss rate usually drops by about 2 Direct-mapped cache of size N has about the same miss rate as a two-way set-associative cache of size N/2 19

6 Cache Optimizations Reduce Miss Rate: Compiler Optimizations Restructuring code affects the data block access sequence Group data accesses together to improve spatial locality Re-order data accesses to improve temporal locality Prevent data from entering the cache Useful for variales that will only be accessed once before eviction Needs mechanism for software to tell hardware not to cache data ( no-allocate instruction hits or page table bits) Kill data that will never be used again Streaming data exploits spatial locality but not temporal locality Replace into dead-cache locations Loop Interchange and Fusion What type of locality does each optimization improve? for(j=0; j < N; j++) { for(i=0; i < M; i++) { x[i][j] = 2 * x[i][j]; } } for(i=0; i < N; i++) a[i] = b[i] * c[i]; for(i=0; i < N; i++) d[i] = a[i] * c[i]; for(i=0; i < M; i++) { for(j=0; j < N; j++) { x[i][j] = 2 * x[i][j]; } } What type of locality does this improve? for(i=0; i < N; i++) { a[i] = b[i] * c[i]; d[i] = a[i] * c[i]; } What type of locality does this impro 20

6 Cache Optimizations Matrix Multiply with Naive Code for(i=0; i < N; i++) for(j=0; j < N; j++) { r = 0; for(k=0; k < N; k++) r = r + y[i][k] * z[k][j]; x[i][j] = r; } z k j y k x j i i Not touched Old access New access Matrix Multiply with Cache Tiling for(jj=0; jj < N; jj=jj+b) for(kk=0; kk < N; kk=kk+b) for(i=0; i < N; i++) for(j=jj; j < min(jj+b,n); j++) { r = 0; for(k=kk; k < min(kk+b,n); k++) r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; } y k z k x j j i i 21

6 Cache Optimizations Reduce Miss Penalty: Multi-Level Caches Hit Processor L1 Cache L2 Cache L1 Miss -- L2 Hit Main Memory AMAL L1 = Hit Time of L1 + ( Miss Rate of L1 AMAL L2 ) AMAL L2 = Hit Time of L2 + ( Miss Rate of L2 Miss Penalty of L2 ) Local miss rate = misses in cache / accesses to cache Global miss rate = misses in cache / processor memory accesses Misses per instruction = misses in cache / number of instructions Reduce Miss Penalty: Multi-Level Caches Use smaller L1 is there is also a L2 Trade increased L1 miss rate for reduced L1 hit time & L1 miss penalty Reduces average access energy Use simpler write-through L1 with on-chip L2 Write-back L2 cahce absorbs write traffic, doesn t go off-chip Simplifies processor pipeline Simplifies on-chip coherence issues Inclusive Multilevel Cache Inner cache holds copy of data in outer cache External coherence is simpler Exclusive Multilevel Cache Inner cache may hold data in outer cache Swap lines between inner/outer cache on miss 22

6 Cache Optimizations Reduce Miss Penalty: Prioritize Reads Processor Hit Cache Write Buffer Main Memory Miss Processor not stalled on writes, and misses can go ahead of writes to main memory Write buffer may hold updated value of location needed by miss On miss, wait for write buffer to be empty Check write buffer addresses and bypass Cache Optimizations Impact on Average Memory Access Latency Hit Miss Miss Technique Time Rate Penalty HW Parallel hit 0 Pipelined write hit 1 Smaller caches ++ 0 Large block size + 0 Large cache size ++ 1 High associativity ++ 1 Compiler optimizations 0 Multi-level cache 2 Prioritize s 1 23

7 Cache Examples 7. Cache Examples Itanium-2 On-Chip Caches Intel Itanium-2 On-Chip Caches (Intel/HP, 2002) Level 1: 16KB, 4-way s.a., 64B line, quad-port (2 load+2 store), single cycle latency Level 2: 256KB, 4-way s.a, 128B line, quad-port (4 load or 4 store), five cycle latency Level 3: 3MB, 12-way s.a., 128B line, single 32B port, twelve cycle latency February 9, 2/17/2009 2010 CS152, Spring 2010 24 IBM Power-7 On-Chip Caches Power 7 On-Chip Caches [IBM 2009] 32KB L1 I$/core 32KB L1 D$/core 3-cycle latency 256KB Unified L2$/core 8-cycle latency 32MB Unified Shared L3$ Embedded DRAM 25-cycle latency to local slice February 9, 2010 CS152, Spring 2010 24 25