CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

Similar documents
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Agenda. Recap: Adding branches to datapath. Adding jalr to datapath. CS 61C: Great Ideas in Computer Architecture

Pipelining. CSC Friday, November 6, 2015

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

COMPUTER ORGANIZATION AND DESI

Full Datapath. Chapter 4 The Processor 2

10/11/17. New-School Machine Structures. Review: Single Cycle Instruction Timing. Review: Single-Cycle RISC-V RV32I Datapath. Components of a Computer

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

LECTURE 3: THE PROCESSOR

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

ECE260: Fundamentals of Computer Engineering

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

COMPUTER ORGANIZATION AND DESIGN

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Full Datapath. Chapter 4 The Processor 2

Processor (II) - pipelining. Hwansoo Han

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Advanced Instruction-Level Parallelism

Processor (IV) - advanced ILP. Hwansoo Han

COMPUTER ORGANIZATION AND DESIGN

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

COSC 6385 Computer Architecture - Pipelining

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Thomas Polzer Institut für Technische Informatik

Chapter 4. The Processor

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers?

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

Chapter 4 The Processor 1. Chapter 4A. The Processor

EIE/ENE 334 Microprocessors

Chapter 4. The Processor

Chapter 4. The Processor

Department of Electrical Engineering and Computer Sciences Fall 2017 Instructors: Randy Katz, Krste Asanovic CS61C MIDTERM 2

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

CSEE 3827: Fundamentals of Computer Systems

ELE 655 Microprocessor System Design

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07

The Processor: Instruction-Level Parallelism

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

Computer Architecture ELEC3441

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 13 EE141

Lecture 4 - Pipelining

EITF20: Computer Architecture Part2.2.1: Pipeline-1

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3

Lecture 2: Processor and Pipelining 1

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Outline Marquette University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Complex Pipelines and Branch Prediction

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Improving Performance: Pipelining

Pipeline Control Hazards and Instruction Variations

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Computer Architecture

CS 152, Spring 2011 Section 2

Computer Architecture. Lecture 6.1: Fundamentals of

CS3350B Computer Architecture Quiz 3 March 15, 2018

5 th Edition. The Processor We will examine two MIPS implementations A simplified version A more realistic pipelined version

Chapter 4 The Processor 1. Chapter 4B. The Processor

EITF20: Computer Architecture Part2.2.1: Pipeline-1

ECE331: Hardware Organization and Design

Pipelining. Maurizio Palesi

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

Final Exam Fall 2007

Lecture 7 Pipelining. Peng Liu.

Lec 25: Parallel Processors. Announcements

(Basic) Processor Pipeline

Chapter 4. The Processor. Jiang Jiang

Chapter 4. The Processor

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27)

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

What is Pipelining? RISC remainder (our assumptions)

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.

More CPU Pipelining Issues

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 4. The Processor

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

ECE 154A Introduction to. Fall 2012

CPU Pipelining Issues

Transcription:

CS 61C: Great Ideas in Computer Architecture Lecture 13: Pipelining Krste Asanović & Randy Katz http://inst.eecs.berkeley.edu/~cs61c/fa17

RISC-V Pipeline Pipeline Control Hazards Structural Data R-type instructions Load Control Superscalar processors Agenda CS 61c Lecture 13: Pipelining 2

Recap: Pipelining with RISC-V instruction sequence add t0, t1, t2 or t3, t4, t5 sll t6, t0, t3 t cycle t instruction Single Cycle Pipelining Timing t step = 100 200 ps t cycle = 200 ps Register access only 100 ps All cycles same length Instruction time, t instruction = t cycle = 800 ps 1000 ps Clock rate, f s 1/800 ps = 1.25 GHz 1/200 ps = 5 GHz Relative speed 1 x 4 x CS 61c 3

instruction sequence add t0, t1, t2 or t3, t4, t5 slt t6, t0, t3 sw t0, 4(t3) lw t0, 8(t3) addi t2, t2, 1 RISC-V Pipeline t cycle = 200 ps t instruction = 1000 ps Resource use in a particular time slot Resource use of instruction over time CS 61c Lecture 13: Pipelining 4

Single-Cycle RISC-V RV32I Datapath alu pc+4 1 0 pc +4 IMEM wb DataD Reg[] inst[11:7] AddrD inst[19:15] AddrA DataA inst[24:20] AddrB DataB Branch Comp. pc Reg[rs1] Reg[rs2] 0 1 1 0 ALU Addr DMEM DataW DataR alu mem pc+4 2 1 0 wb inst[31:7] Imm. Gen imm[31:0] PCSel inst[31:0] ImmSel RegWEn BrUnBrEq BrLT BSel ASel ALUSel MemRW WBSel CS 61c 5

Pipelining RISC-V RV32I Datapath alu pc+4 1 0 pc +4 IMEM wb inst[11:7] inst[19:15] inst[24:20] Reg[] DataD AddrD AddrA DataA AddrB DataB Branch Comp. pc Reg[rs1] Reg[rs2] 0 1 1 0 ALU Addr DMEM DataW DataR alu mem pc+4 2 1 0 wb inst[31:7] Imm. Gen imm[31:0] Instruction Fetch (F) Instruction Decode/Register Read (D) ALU Execute (X) Memory Access (M) Write Back (W) CS 61c 6

Pipelined RISC-V RV32I Datapath Recalculate PC+4 in M stage to avoid sending both PC and PC+4 down pipeline alu X pc F +4 1 0 pc pc pc pc X M F D +4 +4 Reg[] rs1 DataD X ALU DMEM AddrD Addr IMEM alu DataA M AddrA AddrB DataB Branch Comp. DataW DataR inst D rs2 X Imm. imm X rs2 M inst X inst M inst W CS 61c Must pipeline instruction along with data, so control operates correctly in each stage 7

alu X pc F +4 1 0 Each stage operates on different instruction lw t0, 8(t3) pc pc pc pc X M F D +4 +4 Reg[] rs1 DataD X ALU DMEM AddrD Addr IMEM alu DataA M inst D sw t0, 4(t3) AddrA AddrB DataB rs2 X Branch Comp. inst X slt t6, t0, t3 or t3, t4, t5 DataW DataR add t0, t1, t2 rs2 imm M Imm. X inst M inst W CS 61c Pipeline registers separate stages, hold data for each instruction in flight 8

RISC-V Pipeline Pipeline Control Hazards Structural Data R-type instructions Load Control Superscalar processors Agenda CS 61c Lecture 13: Pipelining 9

Pipelined Control Control signals derived from instruction As in single-cycle implementation Information is stored in pipeline registers for use by later stages CS 61c 10

Hazards Ahead CS 61c Lecture 13: Pipelining 11

RISC-V Pipeline Pipeline Control Hazards Structural Data R-type instructions Load Control Superscalar processors Agenda CS 61c Lecture 13: Pipelining 12

Structural Hazard Problem: Two or more instructions in the pipeline compete for access to a single physical resource Solution 1: Instructions take it in turns to use resource, some instructions have to stall Solution 2: Add more hardware to machine Can always solve a structural hazard by adding more hardware CS 61c Lecture 13: Pipelining 13

Regfile Structural Hazards Each instruction: can read up to two operands in decode stage can write one value in writeback stage Avoid structural hazard by having separate ports two independent read ports and one independent write port Three accesses per cycle can happen simultaneously CS 61c Lecture 13: Pipelining 14

Structural Hazard: Memory Access instruction sequence add t0, t1, t2 or t3, t4, t5 slt t6, t0, t3 sw t0, 4(t3) lw t0, 8(t3) Instruction and data memory used simultaneously ü Use two separate memories CS 61c Lecture 13: Pipelining 15

Instruction and Data Caches Processor Control Datapath PC Registers Arithmetic & Logic Unit (ALU) Instruction Cache Data Cache Memory (DRAM) Program Bytes Data CS 61c Caches: small and fast buffer memories Lecture 13: Pipelining 16

Structural Hazards Summary Conflict for use of a resource In RISC-V pipeline with a single memory Load/store requires data access Without separate memories, instruction fetch would have to stall for that cycle All other operations in pipeline would have to wait Pipelined datapaths require separate instruction/data memories Or separate instruction/data caches RISC ISAs (including RISC-V) designed to avoid structural hazards e.g. at most one memory access/instruction Lecture 13: Pipelining 17

RISC-V Pipeline Pipeline Control Hazards Structural Data R-type instructions Load Control Superscalar processors Agenda CS 61c Lecture 13: Pipelining 18

Data Hazard: Register Access add t0, t1, t2 Separate ports, but what if write to same value as read? Does sw in the example fetch the old or new value? instruction sequence or t3, t4, t5 slt t6, t0, t3 sw t0, 4(t3) lw t0, 8(t3) CS 61c Lecture 13: Pipelining 19

Register Access Policy instruction sequence add t0, t1, t2 or t3, t4, t5 slt t6, t0, t3 sw t0, 4(t3) lw t0, 8(t3) Exploit high speed of register file (100 ps) 1) WB updates value 2) ID reads new value Indicated in diagram by shading Might not always be possible to write then read in same cycle, especially in high-frequency designs. Check assumptions in any question. CS 61c Lecture 13: Pipelining 20

Data Hazard: ALU Result Value of s0 5 5 5 5 5/9 9 9 9 9 instruction sequence add s0, t0, t1 sub t2, s0, t0 or t6, s0, t3 xor t5, t1, s0 sw s0, 8(t3) Without some fix, sub and or will calculate wrong result! CS 61c Lecture 13: Pipelining 21

Data Hazard: ALU Result Value of s0 5 add s0, t1, t2 instruction sequence sub t2, s0, t5 or t6, s0, t3 xor t5, t1, s0 sw s0, 8(t3) Without some fix, sub and or will calculate wrong result! CS 61c Lecture 13: Pipelining 22

Solution 1: Stalling Problem: Instruction depends on result from previous instruction add sub s0, t0, t1 t2, s0, t3 Bubble: effectively NOP: affected pipeline stages do nothing

Stalls and Performance Stalls reduce performance But stalls are required to get correct results Compiler can arrange code to avoid hazards and stalls Requires knowledge of the pipeline structure CS 61c 24

Solution 2: Forwarding Value of t0 5 5 5 5 5/9 9 9 9 9 instruction sequence add t0, t1, t2 or t3, t0, t5 sub t6, t0, t3 xor t5, t1, t0 sw t0, 8(t3) Forwarding: grab operand from pipeline stage, CS 61c rather than register file 25

Forwarding (aka Bypassing) Use result when it is computed Don t wait for it to be stored in a register Requires extra connections in the datapath CS 61c Lecture 13: Pipelining 26

add t0, t1, t2 or t3, t0, t5 1) Detect Need for Forwarding D (example) X M W inst X.rd inst D.rs1 Compare destination of older instructions in pipeline with sources of new instruction in decode stage. Must ignore writes to x0! sub t6, t0, t3 CS 61c 27

Forwarding Path alu X pc F +4 1 0 pc pc pc pc X M F D +4 +4 Reg[] rs1 DataD X ALU DMEM AddrD Addr IMEM alu DataA M inst D AddrA AddrB DataB rs2 X Branch Comp. inst X DataW DataR rs2 imm M Imm. X inst M inst W CS 61c Forwarding Control Logic 28

Administrivia Project 1 Part 2 due next Monday Project Party this Wednesday 7-9pm in Cory 293 HW3 will be released by Friday Midterm 1 regrades due tonight Guerrilla Session tonight 7-9pm in Cory 293 CS 61c Lecture 13: Pipelining 29

RISC-V Pipeline Pipeline Control Hazards Structural Data R-type instructions Load Control Superscalar processors Agenda CS 61c Lecture 13: Pipelining 30

Load Data Hazard 1 cycle stall unavoidable forward unaffected CS 61c Lecture 13: Pipelining 31

Stall Pipeline Stall repeat and instruction and forward CS 61c Lecture 13: Pipelining 32

lw Data Hazard Slot after a load is called a load delay slot If that instruction uses the result of the load, then the hardware will stall for one cycle Equivalent to inserting an explicit nop in the slot except the latter uses more code space Performance loss Idea: Put unrelated instruction into load delay slot No performance loss! CS 61c Lecture 13: Pipelining 33

Code Scheduling to Avoid Stalls Reorder code to avoid use of load result in the next instruction! RISC-V code for D=A+B; E=A+C; Stall! Stall! CS 61c Original Order: lw lw add sw lw add sw t1, 0(t0) t2, 4(t0) t3, t1, t2 t3, 12(t0) t4, 8(t0) t5, t1, t4 t5, 16(t0) 13 cycles Lecture 13: Pipelining Alternative: lw lw lw add sw add sw t1, 0(t0) t2, 4(t0) t4, 8(t0) t3, t1, t2 t3, 12(t0) t5, t1, t4 t5, 16(t0) 11 cycles 34

RISC-V Pipeline Pipeline Control Hazards Structural Data R-type instructions Load Control Superscalar processors Agenda CS 61c Lecture 13: Pipelining 35

Control Hazards beq t0, t1, label sub t2, s0, t5 or t6, s0, t3 xor t5, t1, s0 sw s0, 8(t3) executed regardless of branch outcome! executed regardless of branch outcome!!! PC updated reflecting branch outcome CS 61c Lecture 13: Pipelining 36

Observation If branch not taken, then instructions fetched sequentially after branch are correct If branch or jump taken, then need to flush incorrect instructions from pipeline by converting to NOPs CS 61c Lecture 13: Pipelining 37

Kill Instructions after Branch if Taken beq t0, t1, label sub t2, s0, t5 or t6, s0, t3 label: xxxxxx Taken branch Convert to NOP Convert to NOP PC updated reflecting branch outcome CS 61c Lecture 13: Pipelining 38

Reducing Branch Penalties Every taken branch in simple pipeline costs 2 dead cycles To improve performance, use branch prediction to guess which way branch will go earlier in pipeline Only flush pipeline if branch prediction was incorrect CS 61c Lecture 13: Pipelining 39

Branch Prediction beq t0, t1, label label:.... Taken branch Guess next PC! Check guess correct CS 61c Lecture 13: Pipelining 40

RISC-V Pipeline Pipeline Control Hazards Structural Data R-type instructions Load Control Superscalar processors Agenda CS 61c Lecture 13: Pipelining 41

Increasing Processor Performance 1. Clock rate Limited by technology and power dissipation 2. Pipelining Overlap instruction execution Deeper pipeline: 5 => 10 => 15 stages Less work per stage à shorter clock cycle But more potential for hazards (CPI > 1) 3. Multi-issue super-scalar processor Multiple execution units (ALUs) Several instructions executed simultaneously CPI < 1 (ideally) CS 61c Lecture 13: Pipelining 42

Superscalar Processor P&H p. 340 CS 61c Lecture 13: Pipelining 43

Benchmark: CPI of Intel Core i7 CPI = 1 P&H p. 350 CS 61c Lecture 13: Pipelining 44

In Conclusion Pipelining increases throughput by overlapping execution of multiple instructions All pipeline stages have same duration Choose partition that accommodates this constraint Hazards potentially limit performance Maximizing performance requires programmer/compiler assistance E.g. Load and Branch delay slots Superscalar processors use multiple execution units for additional instruction level parallelism Performance benefit highly code dependent CS 61c Lecture 13: Pipelining 45

Extra Slides CS 61c Lecture 13: Pipelining 46

Pipelining and ISA Design RISC-V ISA designed for pipelining All instructions are 32-bits Easy to fetch and decode in one cycle Versus x86: 1- to 15-byte instructions Few and regular instruction formats Decode and read registers in one step Load/store addressing Calculate address in 3 rd stage, access memory in 4 th stage Alignment of memory operands Memory access takes only one cycle Lecture CS 61c 13: Pipelining 47

Superscalar Processor Multiple issue superscalar Replicate pipeline stages Þ multiple pipelines Start multiple instructions per clock cycle CPI < 1, so use Instructions Per Cycle (IPC) E.g., 4GHz 4-way multiple-issue 16 BIPS, peak CPI = 0.25, peak IPC = 4 Dependencies reduce this in practice Out-of-Order execution Reorder instructions dynamically in hardware to reduce impact of hazards CS152 discusses these techniques! CS 61c Lecture 13: Pipelining 48