(Basic) Processor Pipeline

Similar documents
Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

ECE/CS 552: Pipeline Hazards

COSC 6385 Computer Architecture - Pipelining

Chapter 4. The Processor

Pipeline design. Mehran Rezaei

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

LECTURE 3: THE PROCESSOR

COMPUTER ORGANIZATION AND DESIGN

Chapter 4 The Processor 1. Chapter 4A. The Processor

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Full Datapath. Chapter 4 The Processor 2

Computer Architecture

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Full Datapath. Chapter 4 The Processor 2

Processor (II) - pipelining. Hwansoo Han

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

Chapter 4. The Processor

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Thomas Polzer Institut für Technische Informatik

Modern Computer Architecture

COMPUTER ORGANIZATION AND DESIGN

Pipelining. CSC Friday, November 6, 2015

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Hakim Weatherspoon CS 3410 Computer Science Cornell University

CSE 533: Advanced Computer Architectures. Pipelining. Instructor: Gürhan Küçük. Yeditepe University

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27)

Chapter 4 The Processor 1. Chapter 4B. The Processor

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Instruction Pipelining Review

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Computer Architecture Spring 2016

COMPUTER ORGANIZATION AND DESI

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Lecture 7 Pipelining. Peng Liu.

EITF20: Computer Architecture Part2.2.1: Pipeline-1

RISC Pipeline. Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. See: P&H Chapter 4.6

EECS 470 Lecture 4. Pipelining & Hazards II. Fall 2018 Jon Beaumont

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. The Processor

CSEE 3827: Fundamentals of Computer Systems

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Chapter 4. The Processor

ELE 655 Microprocessor System Design

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EIE/ENE 334 Microprocessors

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

ECE260: Fundamentals of Computer Engineering

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Chapter 3. Pipelining. EE511 In-Cheol Park, KAIST

14:332:331 Pipelined Datapath

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 6 Pipelining Part 1

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

Lecture 2: Processor and Pipelining 1

Pipelining: Hazards Ver. Jan 14, 2014

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

The Pipelined RiSC-16

ECE154A Introduction to Computer Architecture. Homework 4 solution

ECE 505 Computer Architecture

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

Pipelined Processors. Ideal Pipelining. Example: FP Multiplier. 55:132/22C:160 Spring Jon Kuhl 1

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

CPE300: Digital System Architecture and Design

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

Outline Marquette University

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

COMPUTER ORGANIZATION AND DESIGN

Very Simple MIPS Implementation

zhandling Data Hazards The objectives of this module are to discuss how data hazards are handled in general and also in the MIPS architecture.

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

ECE473 Computer Architecture and Organization. Pipeline: Data Hazards

Pipelining & Hazards. Prof. Thomas Wenisch GAS STATION. Lecture 3 EECS 470. Slide 1

Very Simple MIPS Implementation

MIPS An ISA for Pipelining

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 4. The Processor

Computer Organization and Structure

Appendix C: Pipelining: Basic and Intermediate Concepts

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

EITF20: Computer Architecture Part2.2.1: Pipeline-1

CENG 3420 Lecture 06: Pipeline

EECS 470. Further review: Pipeline Hazards and More. Lecture 2 Winter 2018

Pipelined datapath Staging data. CS2504, Spring'2007 Dimitris Nikolopoulos

Advanced Computer Architecture

Outline. Pipelining basics The Basic Pipeline for DLX & MIPS Pipeline hazards. Handling exceptions Multi-cycle operations

Transcription:

(Basic) Processor Pipeline Nima Honarmand

Generic Instruction Life Cycle Logical steps in processing an instruction: Instruction Fetch (IF_STEP) Instruction Decode (ID_STEP) Operand Fetch (OF_STEP) Might be from registers or memory Execute (EX_STEP) Perform computation on the operands Result Store or Write Back (RS_STEP) Write the execution results back to registers or memory ISA determines what needs to be done in each step for each instruction Micro-architecture determines how HW implements steps

Datapath vs. Control Logic Datapath is the collection of HW components and their connection in a processor Determines the static structure of processor E.g., inst/data caches, register file, ALU(s), lots of multiplexers, etc. Control logic determines the dynamic flow of data between the components, e.g., the control lines of MUXes and ALU read/write controls of caches and register files enable/disable controls of flip-flops Micro-architecture = Datapath + control logic

Example: MIPS Instruction Set In MIPS, all instructions are 32 bits ALU Mem Control Flow

Building a Simple MIPS Datapath (1) +4 PC I-cache Reg File ALU ALU

Building a Simple MIPS Datapath (2) +4 PC I-cache Reg File ALU D-cache Mem

Building a Simple MIPS Datapath (3) +4 + PC I-cache Reg File ALU D-cache Control Flow

Building a Simple MIPS Datapath (4) +4 + PC I-cache Reg File ALU D-cache Control Flow

Our Final MIPS Datapath Write-Back (WB) +4 + PC I-cache Reg File ALU D-cache Inst. Fetch (IF) Inst. Decode & Register Read (ID) Execute (EX) Memory (MEM) IF_STEP ID_STEP OF_STEP EX_STEP RS_STEP Datapath steps need not directly map to logical steps!

What about the Control Logic? Datapath is only half the micro-architecture Control logic is the other half There are different possibilities for implementing the control logic of our simple MIPS datapath, including Single cycle operation Multi-cycle operation Pipelined operation

Single Cycle Operation Single-cycle ins0.(fetch,dec,ex,mem,wb) ins1.(fetch,dec,ex,mem,wb) Only one instruction is using the datapath at any time Single-cycle control: all components operate in one, very long, clock cycle At the rising edge of clock, PC gets the new address (new inst); it is the address to I$ After some delay, I$ outputs the required word (assuming a hit) After some delay, is decoded and parts of becomes read addresses to register file After some delay, register file outputs the values of the registers After some delay, ALU generates its output and branch-adder generates next inst address; ALU output is the input to D$ (if memory instruction) After some delay, D$ finished its operations (load or store); if load, it generates the output Next inst s cycle: at the rising edge of clock, outputs of ALU or D$ is latched in the register file, and the next-inst address is latched in PC This has good IPC (= 1) but very slow clock

Multi-Cycle Operation (1) Multi-cycle ins0.fetch ins0.(dec,ex) ins0.(mem,wb) ins1.fetch ins1.(dec,ex) ins1.(mem,wb) Again, Only one instruction is using datapath at any time Perform each subset of the previous steps in a different clock cycle First cycle: At the rising edge of clock, PC gets new value, activates I$; I$ generates the instruction word (assuming a hit) Second cycle: At the rising edge of clock, inst word is latched into a temporary register which becomes input to control logic and register file output of register file is fed to ALU ALU generates its output Branch-adder generates its output

Multi-Cycle Operation (2) Multi-cycle ins0.fetch ins0.(dec,ex) ins0.(mem,wb) ins1.fetch ins1.(dec,ex) ins1.(mem,wb) Third cycle: At the rising edge of clock, ALU output is latched into a temporary register and becomes input to D$ D$ performs the operation (assuming a hit) Next instruction s first cycle: ALU or D$ output is stored in register file Next-inst address is latched into PC This has bad IPC (= 0.33) but faster clock Can we have both low IPC and short clock period? Yes, through pipelining

Pipelined Operation Multi-cycle ins0.fetch ins0.(dec,ex) ins0.(mem,wb) ins1.fetch ins1.(dec,ex) ins1.(mem,wb) Pipelined ins0.fetch ins0.(dec,ex) ins0.(mem,wb) time ins1.fetch ins1.(dec,ex) ins2.fetch ins1.(mem,wb) ins2.(dec,ex) ins2.(mem,wb) Start with multi-cycle design When insn0 goes from stage 1 to stage 2, insn1 starts stage 1 Doable as long as different stages use distinct resources This is the case in our datapath Each instruction passes through all stages, but instructions enter and leave at faster rate Style Ideal IPC Cycle Time (1/freq) Single-cycle 1 Long Multi-cycle < 1 Short Pipelined 1 Short Pipeline can have as many insns in flight as there are stages

5-Stage MIPS Pipelined Datapath

Stage 1: Fetch Fetch an instruction from instruction cache every cycle Use PC to index instruction cache Increment PC (assume no branches for now) Write state to the pipeline register IF/ID The next stage will read this pipeline register

Instruction bits Decode PC + 4 Stage 1: Fetch Diagram M U X 4 + target en PC Instruction Cache en IF / ID Pipeline register

Stage 2: Decode Decodes opcode bits Set up Control signals for later stages Read input operands from register file Specified by decoded instruction bits Write state to the pipeline register ID/EX Opcode Register contents, immediate operand PC+4 (even though decode didn t use it) Control signals (from insn) for opcode and destreg

Instruction bits Control Signals/imm regb contents Fetch Execute PC + 4 rega contents PC + 4 Stage 2: Decode Diagram target rega regb destreg data en Register File IF / ID Pipeline register ID / EX Pipeline register

Stage 3: Execute Perform ALU operations Calculate result of instruction Control signals select operation Contents of rega used as one input Either regb or constant offset (imm from insn) used as second input Calculate PC-relative branch target PC+4+(constant offset) Write state to the pipeline register EX/Mem ALU result, contents of regb, and PC+4+offset Control signals (from insn) for opcode and destreg

Control Signals/imm Control Signals regb contents regb contents Decode Memory rega contents ALU result PC + 4 PC+4 +offset Stage 3: Execute Diagram target + M U X A L U destreg data ID / EX Pipeline register EX/Mem Pipeline register

Stage 4: Memory Perform data cache access ALU result contains address for LD or ST Opcode bits control R/W and enable signals Write state to the pipeline register Mem/WB ALU result and Loaded data Control signals (from insn) for opcode and destreg

Control signals Control signals regb contents Execute Loaded data ALU result Write-back ALU result PC+4 +offset Stage 4: Memory Diagram target in_addr in_data Data Cache en R/W destreg data EX/Mem Pipeline register Mem/WB Pipeline register

Stage 5: Write-back Writing result to register file (if required) Write Loaded data to destreg for LD Write ALU result to destreg for ALU insn Opcode bits control register write enable signal

Control signals Memory Loaded data ALU result Stage 5: Write-back Diagram data M U X Mem/WB Pipeline register destreg M U X

Control signals/imm Control signals Control signals Putting It All Together M U X 4 + PC+4 PC+4 + target PC Inst Cache instruction rega regb data dest Register File vala valb M U X A L U eq? ALU result valb Data Cache ALU result mdata M U X data dest M U X IF/ID ID/EX EX/Mem Mem/WB

Pipelining Issues

Pipeline Hazards A pipeline hazard is any condition that disrupts the normal flow of instructions in the pipeline Three types of pipeline hazards 1) Structural hazards: required resource is busy 2) Data hazards: need to wait for previous instruction to complete its data read/write 3) Control hazards: deciding on control flow depends on previous instruction

Structural Hazard (1) Conflict for use of a resource When multiple instructions need the same resource at the same time E.g., in MIPS pipeline with a single cache Load/store requires data access Instruction fetch would have to stall for that cycle Hence, pipelined datapaths require separate instruction/data caches to avoid this structural hazard

Structural Hazard (2) Another example: if the register file could only do either read or write (but not both) in one cycle ID and WB stages would conflict Solution: allow reads and writes in same cycle E.g., perform the write at rising edge of the clock and the read at the falling edge Why not the other way around? Because, in our MIPS pipeline, reads come from younger instructions and writes older inst. If they both access the same register, younger inst. should read the result of the older inst.

Instruction Dependencies (1) Instruction dependencies are root causes of data and control hazards 1) Data Dependence Read-After-Write (RAW) (the only true dependence) Read must wait until earlier write finishes Anti-Dependence (WAR) Write must wait until earlier read finishes (avoid clobbering) Output Dependence (WAW) Earlier write can t overwrite later write 2) Control Dependence (a.k.a., Procedural Dependence) Branch condition and target address must be known before future instructions can be executed

Instruction Dependencies (2) From Quicksort: RAW WAW WAR Control for (; (j < high) && (array[j] < array[low]); ++j); L 1 : L 2 : bge j, high, L 2 mul $15, j, 4 addu $24, array, $15 lw $25, 0($24) mul $13, low, 4 addu $14, array, $13 lw $15, 0($14) bge $25, $15, L 2 addu j, j, 1... addu $11, $11, -1... Real code has lots of dependencies

Hardware Dependency Analysis Pipeline must handle Register Data Dependencies (same register) RAW, WAW, WAR Memory Data Dependencies (same/overlapping locations) RAW, WAW, WAR Control Dependencies

Pipeline: Steady State t 0 t 1 t 2 t 3 t 4 t 5 Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF

Data Hazards Caused by data dependencies between instruction Necessary conditions in linear pipeline WAR: write stage earlier than read stage Is this possible in our pipeline? WAW: write stage earlier than write stage Is this possible in our pipeline? RAW: read stage earlier than write stage Is this possible in our pipeline? If conditions not met, hazards won t happen Check pipeline for both register and memory

Problem: Data Hazard Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 t 0 t 1 t 2 t 3 t 4 t 5 IF ID RD ALU MEM IF ID RD ALU Only RAW is possible in our case and only for registers (not memory) IF ID RD IF ID IF

How to Detect Data Hazard (1) Compare read-register specifiers for newer instructions with write-register specifiers for older instructions E.g., in this 6-stage pipeline, to detect if there is a RAW dependence between inst in RD stage and an older inst: 1a. ID/RD.RegisterRs == RD/ALU.RegisterRd 1b. ID/RD.RegisterRt == RD/ALU.RegisterRd 2a. ID/RD.RegisterRs == ALU/MEM.RegisterRd 2b. ID/RD.RegisterRt == ALU/MEM.RegisterRd 3a. ID/RD.RegisterRs == MEM/WB.RegisterRd 3b. ID/RD.RegisterRt == MEM/WB.RegisterRd Dependency to inst in ALU stage Dependency to inst in MEM stage Dependency to inst in WB stage Should also check that the older instruction is going to write to the register. E.g., in case 1, should also check for RD/ALU.RegWrite && (RD/ALU.RegisterRd!= 0)

How to Detect Data Hazard (2) If there are multiple dependences with older instructions, determine the youngest of the older instruction with which we have a dependency That s the dependency we should resolve In the previous example, inst in ALU is thr youngest of older instructions, so case 1 takes precedence over others

Solution 1: Stall on Data Hazard (1) t 0 t 1 t 2 t 3 t 4 t 5 Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 IF ID Stalled in RD RD ALU MEM WB IF Stalled in ID ID RD ALU MEM WB Stalled in IF IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID Dependent instruction moves to RD, and stays there until dependency is resolved E.g., if inst i+2 depends on inst i+1, inst i+2 has to stall for 3 cycles So do instructions following inst i+2 IF

Solution 1: Stall on Data Hazard (2) Instructions in IF, ID and RD stay ID/RD and IF/ID pipeline registers not updated For stages after RD, send no-op down pipeline (called a bubble) bubble: state of pipeline registers that would correspond to a no-op instruction occupying that stage

Solution 2: Forwarding Paths (1) t 0 t 1 t 2 t 3 t 4 t 5 Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 IF ID RD ALU MEM Idea: avoid stalling by forwarding older inst results to younger ones before they are written to RF.

Solution 2: Forwarding Paths (2) src1 IF ID src2 dest Register File ALU MEM WB

Solution 2: Forwarding Paths (3) IF ID src1 src2 dest Register File = = Deeper pipelines in general require additional forwarding paths = = = = ALU MEM WB

Solution 2: Forwarding Paths (4) t 0 t 1 t 2 t 3 t 4 t 5 Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 IF ID RD ALU MEM Sometimes, forwarding is not enough and some stalling is needed E.g., if inst i+2 depends on inst i+1, and inst i+1 is a load, inst i+2 has to be stalled for at least one cycle until inst i+1 accesses the data cache Then, we can forward the result to inst i+2

Problem: Control Hazard t 0 t 1 t 2 t 3 t 4 t 5 Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 IF ID RD ALU MEM Assume inst i+1 is a branch We won t know the address of inst i+2 until inst i+1 (branch instruction) writes to PC Assume the branch outcome and target is calculated at the ALU stage, but is written back to PC during the MEM stage Similar to our 5-stage MIPS pipeline

Solution 1: Stall on Control Hazard Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 t 0 t 1 t 2 t 3 t 4 t 5 Stop fetching until branch outcome is known Send no-ops down the pipe Easy to implement Stalled in IF Requires simple pre-decoding in IF to know if inst i+1 is a branch Performs poorly On out of ~6 instructions are branches Each branch takes 4 cycles to resolve CPI = 1 + 4 x 1/6 = 1.67 (best case (lower bound)) IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF

Solution 1: Stall on Control Hazard Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 t 0 t 1 t 2 t 3 t 4 t 5 Stalled in IF Stop fetching until branch outcome is known IF ID RD ALU MEM IF ID RD ALU Easy to implement Requires simple pre-decoding in IF to know if insti+1 is a branch Send no-ops down the pipe Performs poorly 1 out of ~6 instructions are branches Each branch takes 4 cycles to resolve CPI = 1 + 4 x 1/6 = 1.67 (best case (lower bound)) IF ID RD IF ID IF

Solution 2: Prediction for Control Hazards Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 New Inst i+2 New Inst i+3 New Inst i+4 t 0 t 1 t 2 t 3 t 4 t 5 IF ID RD ALU nop nop nop Fetch Resteered IF ID nop RD nop nop Predict branch not taken Send sequential instructions down pipeline IF nop ID nop nop IF ID RD IF ID IF Speculative State Cleared nop nop ALU RD ID nop nop ALU We would know the branch outcome the end of ALU If incorrect prediction, kill speculative instructions (turn them into no-ops by setting pipeline registers) Fetch from branch target Important: Speculative instructions cannot perform memory and RF writes No problem in this pipeline Because MEM and WB stages of speculative instructions come after ALU stage of branch RD

Solution 3: Delay Slots for Control Hazards Another option: delayed branches # of delay slots (ds) : less-than-or-equal-to # stages between IF and where the branch is resolved 3 (IF to ALU) in our example Always execute following ds instructions regardless of branch outcome Compiler should put useful instruction there, otherwise no-op insts Has lost popularity but lingers for compatibility reasons Just a stopgap (one cycle, one instruction) In superscalar processors, delay slot just gets in the way Legacy from old RISC ISAs

Control signals/imm Control signals Control signals Hazards & Backward-Going Lines in Pipeline M U X PC 4 + Inst Cache PC+4 instruction rega regb data dest Register File PC+4 vala valb M U X + A L U target eq? ALU result valb Data Cache ALU result mdata M U X data dest M U X IF/ID ID/EX EX/Mem Mem/WB In a linear pipeline, all structural, data and control hazards manifest as backward-going lines in the pipeline design You can use them to double-check your identification of possible control hazards in your pipeline