ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 6 Pipelining Part 1

Similar documents
Lecture 4 - Pipelining

Lecture 7 Pipelining. Peng Liu.

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 8 Instruction-Level Parallelism Part 1

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

Computer Architecture ELEC3441

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 4 Reduced Instruction Set Computers

CS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP

CS 152, Spring 2011 Section 2

Simple Instruction Pipelining

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

COSC 6385 Computer Architecture - Pipelining

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

ECE/CS 552: Pipeline Hazards

CS 152 Computer Architecture and Engineering. Lecture 16 - VLIW Machines and Statically Scheduled ILP

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

CS 152 Computer Architecture and Engineering. Lecture 16: Vector Computers

ECE154A Introduction to Computer Architecture. Homework 4 solution

Computer Architecture

Very Simple MIPS Implementation

LECTURE 3: THE PROCESSOR

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 3 Early Microarchitectures

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Pipelining. Maurizio Palesi

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Hakim Weatherspoon CS 3410 Computer Science Cornell University

(Basic) Processor Pipeline

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 2 CISC and Microcoding

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27)

Pipelining. Pipeline performance

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Full Datapath. Chapter 4 The Processor 2

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Appendix C. Abdullah Muzahid CS 5513

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Processor (II) - pipelining. Hwansoo Han

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

C 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA

Pipelining. CSC Friday, November 6, 2015

Modern Computer Architecture

EITF20: Computer Architecture Part2.2.1: Pipeline-1

CSEE 3827: Fundamentals of Computer Systems

Chapter 4 The Processor 1. Chapter 4B. The Processor

Very Simple MIPS Implementation

CS 152, Spring 2011 Section 8

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

ECE232: Hardware Organization and Design

Advanced Computer Architecture Pipelining

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

ECE331: Hardware Organization and Design

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 19 Summary

Chapter 4 The Processor 1. Chapter 4A. The Processor

ECE473 Computer Architecture and Organization. Pipeline: Data Hazards

RISC Pipeline. Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. See: P&H Chapter 4.6

Basic Pipelining Concepts

Computer Systems Architecture Spring 2016

ECE260: Fundamentals of Computer Engineering

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

MIPS An ISA for Pipelining

ECS 154B Computer Architecture II Spring 2009

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

Slide Set 7. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

ECE260: Fundamentals of Computer Engineering

COMPUTER ORGANIZATION AND DESIGN

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

1 Hazards COMP2611 Fall 2015 Pipelined Processor

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

VLIW/EPIC: Statically Scheduled ILP

Chapter 4. The Processor

CS 152 Computer Architecture and Engineering. Lecture 5 - Pipelining II (Branches, Exceptions)

Full Datapath. Chapter 4 The Processor 2

Lecture 4 Pipelining Part II

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Thomas Polzer Institut für Technische Informatik

C 1. Last Time. CSE 490/590 Computer Architecture. ISAs and MIPS. Instruction Set Architecture (ISA) ISA to Microarchitecture Mapping

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

6.823 Computer System Architecture Datapath for DLX Problem Set #2

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

CSE 533: Advanced Computer Architectures. Pipelining. Instructor: Gürhan Küçük. Yeditepe University

CSCI-564 Advanced Computer Architecture

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

Lecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipeline Control Hazards and Instruction Variations

Complex Pipelines and Branch Prediction

COMPUTER ORGANIZATION AND DESIGN

Transcription:

ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 6 Pipelining Part 1 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall12.html

ECE552 Administrivia 27 September Homework #2 Due Assignment on web page. Teams of 2-3. Submit soft copies to Sakai. Use Piazza for questions. 2 October Class Discussion Roughly one reading per class. Do not wait until the day before! 1. Srinivasan et al. Optimizing pipelines for power and performance 2. Mahlke et al. A comparison of full and partial predicated execution support for ILP processors 3. Palacharla et al. Complexity-effective superscalar processors 4. Yeh et al. Two-level adaptive training branch prediction ECE 552 / CPS 550 2

Pipelining Latency = (Instructions / Program) x (Cycles / Instruction) x (Seconds / Cycle) Performance Enhancement - Increases number of cycles per instruction - Reduces number of seconds per cycle Instruction-Level Parallelism - Begin with multi-cycle design - When one instruction advances from stage-1 to stage=2, allow next instruction to enter stage-1. - Individual instructions require the same number of stages - Multiple instructions in-flight, entering and leaving at faster rate Multi-cycle insn0.fetch insn0.dec insn0.exec insn1.fetch insn1.dec insn1.exec Pipelined insn0.fetch insn0.dec insn0.exec insn1.fetch insn1.dec insn1.exec ECE 552 / CPS 550 3

Ideal Pipelining stage 1 stage 2 stage 3 stage 4 - All objects go through the same stages - No resources shared between any two stages - Equal propagation delay through all pipeline stages - An object entering the pipeline is not affected by objects in other stages - These conditions generally hold for industrial assembly lines - But can an instruction pipeline satisfy the last condition? Technology Assumptions - Small, very fast memory (caches) backed by large, slower memory - Multi-ported register file, which is slower than a single-ported one - Consider 5-stage pipelined Harvard architecture ECE 552 / CPS 550 4

Practical Pipelining stage 1 stage 2 stage 3 stage 4 Pipeline Overheads - Each stage requires registers, which hold state/data communicated from one stage to next, incurring hardware and delay overheads - Each stage requires partitioning logic into equal lengths - Introduces diminishing marginal returns from deeper pipelines Pipeline Hazards - Instructions do not execute independently - Instructions entering the pipeline depend on in-flight instructions or contend for shared hardware resources ECE 552 / CPS 550 5

Pipelining MIPS First, build MIPS without pipelining - Single-cycle MIPS datapath Then, pipeline into multiple stages - Multi-cycle MIPS datapath - Add pipeline registers to separate logic into stages - MIPS partitions into 5 stages - 1: Instruction Fetch (IF) - 2: Instruction Decode (ID) - 3: Execute (EX) - 4: Memory (MEM ) - 5: Write Back (WB) ECE 552 / CPS 550 6

5-Stage Pipelined Datapath (MIPS) IF/ID ID/EX EX/MEM MEM/WB IF: mem[pc]; PC PC + 4; ID: A Reg[ rs ]; B Reg[ rt ]; ECE 552 / CPS 550 7

5-Stage Pipelined Datapath (MIPS) IF/ID ID/EX EX/MEM MEM/WB EX: Result A op op B; MEM: WB Result; WB: Reg[ rd ] WB ECE 552 / CPS 550 8

Visualizing the Pipeline ECE 552 / CPS 550 9

Hazards and Limits to Pipelining Hazards prevent next instruction from executing during its designated clock cycle Structural Hazards - Hardware cannot support this combination of instructions. - Example: Limited resources required by multiple instructions (e.g. FPU) Data Hazards - Instruction depends on result of prior instruction still in pipeline - Example: An integer operation is waiting for value loaded from memory Control Hazards - Instruction fetch depends on decision about control flow - Example: Branches and jumps change PC ECE 552 / CPS 550 10

Structural Hazards A single memory port causes structural hazard during data load, instr fetch ECE 552 / CPS 550 11

Structural Hazards Stall the pipeline, creating bubbles, by freezing earlier stages interlocks Use Harvard Architecture (separate instruction, data memories) ECE 552 / CPS 550 12

Data Hazards Instruction depends on result of prior instruction still in pipeline ECE 552 / CPS 550 13

Data Hazards Read After Write (RAW) - Caused by a dependence, need for communication - Instr-j tries to read operand before Instr-I writes it i: add r1, r2, r3 j: sub r4, r1, 43 Write After Read (WAR) - Caused by an anti-dependence and the re-use of the name r1 - Instr-j tries to write operand (r1) before Instr-I reads it i: add r4, r1, r3 j: add r1, r2, r3 k: mul r6, r1, r7 Write After Write (WAW) - Caused by an output dependence and the re-use of the name r1 - Instr-j tries to write operand (r1) before Instr-I writes it i: sub r1, r4, r3 j: add r1, r2, r3 k: mul r6, r1, r7 ECE 552 / CPS 550 14

Resolving Data Hazards FB 1 FB 2 FB 3 FB 4 stage 1 stage 2 stage 3 stage 4 Strategy 1 Interlocks and Pipeline Stalls - Later stages provide dependence information to earlier stages, which can stall or kill instructions - Works as long as instruction at stage i+1 can complete without any interference from instructions in stages 1 through i (otherwise, deadlocks may occur) ECE 552 / CPS 550 15

Interlocks & Pipeline Stalls time t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) r1 (r0) + 10 IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) r4 (r1) + 17 IF 2 ID 2 ID 2 ID 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) IF 3 IF 3 IF 3 IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) (I 5 ) stalled stages IF 4 ID 4 IF 5 EX 4 ID 5 MA 4 WB 4 EX 5 MA 5 WB 5 Resource Usage time t0 t1 t2 t3 t4 t5 t6 t7.... IF I 1 I 2 I 3 I 3 I 3 I 3 I 4 I 5 ID I 1 I 2 I 2 I 2 I 2 I 3 I 4 I 5 EX I 1 nop nop nop I 2 I 3 I 4 I 5 MA I 1 nop nop nop I 2 I 3 I 4 I 5 WB I 1 nop nop nop I 2 I 3 I 4 I 5 ECE 552 / CPS 550 16

Interlocks & Pipeline Stalls Stall Condition Example Dependence r1 r0 + 10 r4 r1 + 17 0x4 Add nop 31 PC addr inst Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B ALU Y we addr rdata Data Memory wdata wdata R MD1 MD2 ECE 552 / CPS 550 17

Interlock Control Logic - Compare the source registers of instruction in decode stage with the destination registers of uncommitted instructions - Stall if a source register in decode matches some destination register? - No, not every instruction writes to a register - No, not every instruction reads from a register - Derive stall signal from conditions in the pipeline ECE 552 / CPS 550 18

Interlock Control Logic stall C stall ws rs rt? 0x4 Add nop 31 PC addr inst Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B ALU Y we addr rdata Data Memory wdata wdata R MD1 MD2 Compare the source registers of the instruction in the decode stage (rs, rt) with the destination register of the uncommitted instructions (ws). ECE 552 / CPS 550 19

Interlock Control Logic stall ws we C stall rs rt? re1 re2 we C dest ws we ws C dest 0x4 Add C re nop 31 PC addr inst Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B ALU Y we addr rdata Data Memory wdata wdata C dest R MD1 MD2 Should we always stall if RS/RT matches some WS? No, because not every instruction writes/reads a register. Introduce write/read enable signals (we/re) ECE 552 / CPS 550 20

Source and Destination Registers R-type: op rs rt rd func I-type: op rs rt immediate16 J-type: op immediate26 instruction source(s) destination ALU rd (rs) func (rt) rs, rt rd ALUi rt (rs) op imm rs rt LW rt M[(rs) + imm] rs rt SW M [(rs) + imm] (rt) rs, rt BZ cond (rs) true: PC (PC) + imm rs false: PC (PC) + 4 rs J PC (PC) + imm JAL r31 (PC), PC (PC) + imm R31 JR PC (rs) rs JALR r31 (PC), PC (rs) rs R31 ECE 552 / CPS 550 21

Interlock Control Logic stall ws we C stall rs rt? re1 re2 we C dest ws we ws C dest 0x4 Add C re nop 31 PC addr inst Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B ALU Y we addr rdata Data Memory wdata wdata C dest R MD1 MD2 Should we always stall if RS/RT matches some RD? No, because not every instruction writes/reads a register. Introduce write/read enable signals (we/re) ECE 552 / CPS 550 22

Deriving the Stall Signal Cdest ws Case(opcode) ALU: ALUi: JAL, JALR: ws rd ws rt ws R31 we Case(opcode) ALU, ALUi, LW we (ws!= 0) JAL, JALR we 1 otherwise we 0 Cre re1 Case(opcode) ALU, ALUi re1 1 LW, SW, BZ re1 1 JR, JALR re1 1 J, JAL re1 0 re2 Case(opcode) << same as re1 but for register rt>> ECE 552 / CPS 550 23

Deriving the Stall Signal Notation: [pipeline-stage][signal] E.g., Drs rs signal from decode stage E.g., Ewe we signal from execute stage Cstall stall-1 ( (Drs == Ews) & Ewe (Drs == Mws) & Mwe (Drs == Wws) & Wwe ) & Dre1 stall-2 ( (Drt == Ews) & Ewe (Drt == Mws) & Mwe (Drt == Wws) & Wwe ) & Dre2 stall stall-1 stall-2 ECE 552 / CPS 550 24

Load/Store Data Hazards M[(r1)+7] (r2) r4 M[(r3)+5] What is the problem here? What if (r1)+7 == (r3)+5? Load/Store hazards may be resolved in the pipeline or may be resolved in the memory system. More later. ECE 552 / CPS 550 25

Resolving Data Hazards Strategy 2 Forwarding (aka Bypasses) - Route data as soon as possible to earlier stages in the pipeline - Example: forward ALU output to its input t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) r1 r0 + 10 IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) r4 r1 + 17 IF 2 ID 2 ID 2 ID 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) IF 3 IF 3 IF 3 IF 3 ID 3 EX 3 MA 3 (I 4 ) stalled stages IF 4 ID 4 EX 4 (I 5 ) IF 5 ID 5 time t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) r1 r0 + 10 IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) r4 r1 + 17 IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) IF 4 ID 4 EX 4 MA 4 WB 4 (I 5 ) IF 5 ID 5 EX 5 MA 5 WB 5 ECE 552 / CPS 550 26

Example Forwarding Path stall 0x4 Add nop E M W 31 ASrc PC addr inst Inst Memory D we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B ALU Y we addr rdata Data Memory wdata R MD1 MD2 ECE 552 / CPS 550 27

Deriving Forwarding Signals This forwarding path only applies to the ALU operations Eforward Case(Eopcode) ALU, ALUi Eforward (ws!= 0) otherwise Eforward 0 and all other operations will need to stall as before Estall Case(Eopcode) LW Estall (ws!= 0) JAL, JALR Estall 1 otherwise Estall 0 Asrc (Drs == Ews) & Dre1 & Eforward Remember to update stall signal, removing case covered by this forwarding path ECE 552 / CPS 550 28

Multiple Forwarding Paths ECE 552 / CPS 550 29

Multiple Forwarding Paths stall PC for JAL,... 0x4 Add ASrc nop E M W 31 PC addr inst Inst Memory D we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext BSrc A B MD1 ALU Y MD2 we addr rdata Data Memory wdata R ECE 552 / CPS 550 30

Forwarding Hardware ECE 552 / CPS 550 31

Forwarding Loads/Stores ECE 552 / CPS 550 32

Data Hazard Despite Forwarding LD cannot forward (backwards in time) to DSUB. What is the solution? ECE 552 / CPS 550 33

Data Hazards and Scheduling Try producing faster code for - A = B + C; D = E F; - Assume A, B, C, D, E, and F are in memory - Assume pipelined processor Slow Code LW Rb, b LW Rc, c ADD Ra, Rb, Rc SW a, Ra LW Re e LW Rf, f SUB Rd, Re, Rf SW d, RD Fast Code LW Rb, b LW Rc, c LW Re, e ADD Ra, Rb, Rc LW Rf, f SW a, Ra SUB Rd, Re, Rf SW d, RD ECE 552 / CPS 550 34

Acknowledgements These slides contain material developed and copyright by - Arvind (MIT) - Krste Asanovic (MIT/UCB) - Joel Emer (Intel/MIT) - James Hoe (CMU) - John Kubiatowicz (UCB) - Alvin Lebeck (Duke) - David Patterson (UCB) - Daniel Sorin (Duke) ECE 552 / CPS 550 35