CPU Pipelining Issues

Similar documents
More CPU Pipelining Issues

CPU Pipelining Issues

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers?

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07

Pipeline Issues. This pipeline stuff makes my head hurt! Maybe it s that dumb hat. L17 Pipeline Issues Fall /5/02

CPU Pipelining Issues

Pipelining the Beta. I don t think they mean the fish...

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Chapter 4 The Processor 1. Chapter 4B. The Processor

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Lecture 7 Pipelining. Peng Liu.

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

ECE/CS 552: Pipeline Hazards

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Pipeline design. Mehran Rezaei

6.004 Computation Structures Spring 2009

Very Simple MIPS Implementation

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Lecture 4 - Pipelining

CS61C : Machine Structures

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Implementing a MIPS Processor. Readings:

Very Simple MIPS Implementation

Appendix C. Abdullah Muzahid CS 5513

Pipeline Hazards. Midterm #2 on 11/29 5th and final problem set on 11/22 9th and final lab on 12/1.

ECEC 355: Pipelining

Instruction Pipelining

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipelining. Maurizio Palesi

Chapter 4. The Processor

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

Computer Architecture

Computer Architecture. Lecture 6.1: Fundamentals of

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

CS 152, Spring 2011 Section 2

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EECS Digital Design

Concocting an Instruction Set

Chapter 4. The Processor

ELE 655 Microprocessor System Design

COSC 6385 Computer Architecture - Pipelining

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

CS 152 Computer Architecture and Engineering Lecture 4 Pipelining

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Instruction Pipelining

(Basic) Processor Pipeline

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Lecture Topics. Announcements. Today: Single-Cycle Processors (P&H ) Next: continued. Milestone #3 (due 2/9) Milestone #4 (due 2/23)

LECTURE 3: THE PROCESSOR

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 13 EE141

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Computer Architecture and Engineering. CS152 Quiz #1. February 19th, Professor Krste Asanovic. Name:

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27)

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

COMPUTER ORGANIZATION AND DESIGN

Pipelining is Hazardous!

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

COMPUTER ORGANIZATION AND DESIGN

CS3350B Computer Architecture Quiz 3 March 15, 2018

Pipelining. Each step does a small fraction of the job All steps ideally operate concurrently

Processor (II) - pipelining. Hwansoo Han

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Computer Architecture and Engineering CS152 Quiz #1 Wed, February 17th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Full Datapath. CSCI 402: Computer Architectures. The Processor (2) 3/21/19. Fengguang Song Department of Computer & Information Science IUPUI

Chapter 4 The Processor 1. Chapter 4A. The Processor

Chapter 4. The Processor

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Introduction to Pipelining. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Chapter 4. The Processor

Computer Architecture ELEC3441

CSE 378 Midterm 2/12/10 Sample Solution

CS311 Lecture: Pipelining and Superscalar Architectures

Processor Architecture

Basic Pipelining Concepts

Lecture 8: Data Hazard and Resolution. James C. Hoe Department of ECE Carnegie Mellon University

Chapter 4. The Processor

EIE/ENE 334 Microprocessors

CS61C : Machine Structures

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations

Concocting an Instruction Set

Lecture 2: Processor and Pipelining 1

A General-Purpose Computer The von Neumann Model. Concocting an Instruction Set. Meaning of an Instruction. Anatomy of an Instruction

Thomas Polzer Institut für Technische Informatik

Computer Architecture Spring 2016

ECE232: Hardware Organization and Design

Computer Organization and Structure

Transcription:

CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! Finishing up Chapter 6 L20 Pipeline Issues 1

Structural Data Hazard Consider LOADS: Can we fix all these problems using our previous bypass paths? lw $t4,0($t1) add $t5,$t1,$t4 xor $t6,$t3,$t4 IF RF WB i i+1 i+2 i+3 i+4 i+5 i+6 lw add xor lw add xor lw add xor lw add xor Load data hazards are complicated by the fact that their result is resolved later than the pipeline stage. For a lw instruction fetched during clock I, data isn t returned from memory until late into cycle I+3. Bypassing will fix xor but not add! L20 Pipeline Issues 2

Load Delays Bypassing can t fix the problem with add since the data simply isn t available! In order to fix it we have to add pipeline interlock hardware to stall add s execution, or else program around it. IF RF WB i i+1 i+2 i+3 i+4 i+5 i+6 lw add xor xor lw add add xor lw nop add xor lw nop add xor lw $t4,0($t1) add $t5,$t1,$t4 xor $t6,$t3,$t4 Adding stalls to the pipeline in order to assure proper operation is sometimes called inserting pipeline BUBBLES This requires inserting a MUX just before the instruction register of the stage, IR, to annul the add (by inserting a NOP) as well as, clock enables on the PC and IR pipeline registers of earlier pipeline stages to stall the execution without annuling any instructions. This is how the simulator, SPIM works. L20 Pipeline Issues 3

Punting on Load Interlock Early versions of MIPS did not include a pipeline interlock, thus, requiring the compiler/programmer to work around it. lw $t4,0($t1) nop add $t5,$t1,$t4 xor $t6,$t3,$t4 i i+1 i+2 i+3 i+4 i+5 i+6 IF lw nop add xor RF lw nop add xor WB lw nop add xor lw nop add xor If compiler knows about load delay, it can often rearrange the code sequence to eliminate the hazard. Many compilers can provide implementationspecific instruction scheduling. This requires no additional H/W, but it further confuses the instruction semantics. We ll include interlocks for SPIM compatibility. L20 Pipeline Issues 4

Load Delays (cont d) But, but, what about FASTER processors? FACT: Processors have been become very fast relative to memories! Can we just stall the pipe longer? Add more NOPs? ALTERNATIVE: Longer pipelines. 1. Add MEMORY WAIT stages between START of read operation & return of data. 2. Build pipelined memories, so that multiple (say, N) memory transactions can be in progress at once. 3. (Optional). Stall pipeline when the N limit is exceeded. Sadly, this is our bottleneck. If we want to go faster will have to surround it with pipeline stages 4-Stage pipeline requires READ access in LESS than one clock. SOLUTION: A 5-Stage pipeline that allows nearly two clocks for data memory accesses... L20 Pipeline Issues 5

0x80000000 0x80000040 0x80000080 PC<31:29>:J<25:0>:00 PC +4 PCSEL 6 5 4 3 2 1 0 Fetch 00 A PC REG 00 IR REG D 5-Stage minimips Omits some details NO bypass or interlock logic J:<25:0> Imm: <15:0> WA Rs: <25:21> RA1 RA2 RD1 RD2 = BZ Rt: <20:16> x4 + FN shamt:<10:6> 0 1 2 ASEL 1 0 BSEL PC 00 IR A B WD A B PC MEM 00 IR MEM Y MEM WD MEM N V C 16 Z Address is available right after instruction enters stage Write Back PC+4 PC WB 00 IR WB Y WB WASEL WERF Rt:<20:16> Rd:<15:11> 0 1 2 WE WA 31 27 0 1 2 3 WA WD WDSEL Adr WD R/W Data RD Wr almost 2 clock cycles Data is needed just before rising clock edge at end of Write Back stage L20 Pipeline Issues 6

One More Fly in the Ointment There is one more structural hazard that we have not discussed. That is, the saving, and subsequent accesses, of the return address resulting from the jumpand-link, jal, instruction. Moreover, given that we have bought into a single delayslot, which is always executed, we now need to store the address of the instruction FOLLOWING the delay slot instruction. We need to return here, to PC+8, not PC+4. Once more we need to rewrite the ISA spec! jal sqr # call procedure addi $a0,$0,10 # delay slot addi $t0,$v0,-1 # return address L20 Pipeline Issues 7

Return Address Writes IF RF MEM WB # The code: Assume Reg[LP] = 100... add $ra,$0,$0 jal f addi $ra,$ra,4 # In delay slot... f: xor $t0,$ra,$0 or $r1,$0,$ra add $t2,$0,$ra i i+1 i+2 i+3 i+4 i+5 i+6 add jal addi xor or add add jal addi xor or add add jal addi xor or add jal addi xor add jal addi Can we make the Can we make the regfile accesses of the 3 instructions regfile accesses of the 3 instructions following the jal following the jal work work by by bypassing? Where do do we we get get the the right right return address from? BR Decision Time ADDI reads BR writes L20 Pipeline Issues 8

0x80000000 0x80000040 0x80000080 PC<31:29>:J<25:0>:00 PC +4 PCSEL 6 5 4 3 2 1 0 Fetch addi $31,$31,4 jal f add $ra,$0,$0 00 A PC REG 00 IR REG J:<25:0> A,B Bypass D Imm: <15:0> x4 + FN WA A B N V C PC+4 Rs: <25:21> Z RA1 RA2 RD1 RD2 = BZ shamt:<10:6> 16 0 1 2 WDSEL Rt: <20:16> 0 1 2 ASEL 1 0 BSEL PC 00 IR A B WD PC MEM 00 IR MEM Y MEM WD MEM PC WB 00 IR WB Y WB WASEL Rt:<20:16> Rd:<15:11> 31 27 0 1 2 3 (fetching xor at f) JAL PC Bypasses Adr WD R/W Data RD Wr On JALs, the register file saves the next address from the DELAY SLOT instruction (often PC+8). Note this bypass is routed from the PC pipeline not from the output. Thus, we need to add bypass paths for PC MEM. Write Back WERF WE WA WA WD L20 Pipeline Issues 9

0x80000000 0x80000040 0x80000080 PC<31:29>:J<25:0>:00 PC +4 PCSEL 6 5 4 3 2 1 0 Fetch 00 A PC REG 00 IR REG D (fetching or at f+4) JAL PC Bypasses We need another PC bypass. xor $8,$31,$0 J:<25:0> Imm: <15:0> x4 + WA Rs: <25:21> RA1 RA2 RD1 RD2 = BZ shamt:<10:6> Rt: <20:16> 0 1 2 ASEL 1 0 BSEL PC 00 IR A B WD 16 In this case, the bypass path supplies the $31 operand for the XOR instruction. addi $31,$31,4 A,B Bypass FN A B jal f add $ra,$0,$0 PC MEM 00 IR MEM Y MEM WD MEM WASEL Rt:<20:16> Rd:<15:11> 31 27 0 1 2 3 N V C PC+4 PC WB 00 IR WB Y WB Z 0 1 2 WDSEL Adr WD R/W Data RD Wr Write Back WERF WE WA WA WD L20 Pipeline Issues 10

0x80000000 0x80000040 0x80000080 PC<31:29>:J<25:0>:00 PC +4 PCSEL 6 5 4 3 2 1 0 Fetch or $1,$0,$31 00 A PC REG 00 IR REG J:<25:0> D Imm: <15:0> x4 + WA Rs: <25:21> RA1 RA2 RD1 RD2 = BZ shamt:<10:6> 16 Rt: <20:16> 0 1 2 ASEL 1 0 BSEL JAL PC Bypasses (fetching add at f+8) PC 00 IR A B WD And, we need another PC MEM bypass. In this case, the bypass path supplies the $31 operand for the OR instruction. xor $8,$31,$0 addi $31,$31,4 jal f Write Back FN A B PC MEM 00 IR MEM Y MEM WD MEM A,B Bypass WASEL WERF Rt:<20:16> Rd:<15:11> WE WA 31 27 0 1 2 3 N V C PC+4 PC WB 00 IR WB Y WB Z 0 1 2 WA WD WDSEL Adr WD R/W Data RD Wr PC WB is already taken care of, for the following ADD, using the WB stage bypass at the output of the WDSEL mux. L20 Pipeline Issues 11

0x80000000 0x80000040 0x80000080 PC<31:29>:J<25:0>:00 PC +4 PCSEL 6 5 4 3 2 1 0 Fetch Write Back 00 A PC REG 00 IR REG J:<25:0> D NOP Imm: <15:0> x4 + FN WA A B N V C PC+4 5-Stage minimips Rs: <25:21> Z RA1 RA2 RD1 RD2 = BZ shamt:<10:6> 16 0 1 2 WDSEL Rt: <20:16> 0 1 2 ASEL 1 0 BSEL PC 00 IR A B WD PC MEM 00 IR MEM Y MEM WD MEM PC WB 00 IR WB Y WB WASEL WERF Rt:<20:16> Rd:<15:11> WE WA 31 27 0 1 2 3 WA WD We wanted a simple, clean pipeline but Adr WD R/W Data RD Wr broke the sequential semantics of ISA by adding a branch delay-slot and early branch resolution logic added A/B bypass muxes to get data before it s written to regfile added CLK EN to freeze IF/RF stages so we can wait for lw to reach WB stage L20 Pipeline Issues 12

Bypass MUX Details The previous diagram was oversimplified. Really need for the bypass muxes to precede the A and B muxes to provide the correct values for the jump target (), write data, and early branch decision logic. RD1 RD2 from /MEM/WB/PC from /MEM/WB/PC A Bypass B Bypass shamt = 16 (imm) 2 1 0 ASEL BZ 1 0 BSEL A B WD To To To Mem L20 Pipeline Issues 13

Bypass Logic minimips bypass logic (need two copies for A/B data): rs or rt Regfile (no bypass) rt or rd (WB)* WB bypass rt or rd (MEM)* MEM bypass rt or rd ()* bypass 0 5-bit compare Select 0 * If instruction is a sw (doesn t write into regfile), set rt for /MEM/WB to $0 L20 Pipeline Issues 14

0x80000000 0x80000040 0x80000080 PC<31:29>:J<25:0>:00 PC +4 PCSEL 6 5 4 3 2 1 0 Fetch 00 A PC REG 00 IR REG J:<25:0> NOP D Imm: <15:0> Final 5-Stage minimips x4 + FN WA Rs: <25:21> RA1 RA2 RD1 RD2 shamt:<10:6> A B Rt: <20:16> 0 1 2 ASEL 1 0 BSEL PC 00 IR A B WD NOP 16 = BZ Added branch delay slot and early branch resolution logic to fix a CONTROL hazard Added lots of bypass paths and detection logic to fix various STRUCTURAL hazards Write Back PC MEM 00 IR MEM Y MEM WD MEM WE WA N V C PC+4 PC WB 00 IR WB Y WB WASEL WERF Rt:<20:16> Rd:<15:11> 31 27 0 1 2 3 Z 0 1 2 WA WD WDSEL Adr WD R/W Data RD Wr Added pipeline interlocks to fix load delay STRUCTURAL hazard L20 Pipeline Issues 15

Pipeline Summary (I) Started with unpipelined implementation direct execute, 1 cycle/instruction it had a long cycle time: mem + regs + alu + mem + wb We ended up with a 5-stage pipelined implementation increase throughput (3x???) delayed branch decision (1 cycle) Choose to execute instruction after branch delayed register writeback (3 cycles) Add bypass paths (6 x 2 = 12) to forward correct value memory data available only in WB stage Introduce NOPs at IR, to stall IF and RF stages until LD result was ready L20 Pipeline Issues 16

Pipeline Summary (II) Fallacy #1: Pipelining is easy Smart people get it wrong all of the time! Costs? Respins of the design. Force S/W folks to devise program/compiler workarounds. Fallacy #2: Pipelining is independent of ISA Many ISA decisions impact how easy/costly it is to implement pipelining (i.e. branch semantics, addressing modes). Bad decisions impact future implementations. (delay slot vs. annul?, load interlocks?) and break otherwise clean semantics. For performance, S/W must be aware! Fallacy #3: Increasing Pipeline stages improves performance Diminishing returns. Increasing complexity. Can introduce unusable delay slots, long interlock stalls. L20 Pipeline Issues 17

RISC = Simplicity??? The P.T. Barnum World s Tallest Dwarf Competition World s Most Complex RISC? RISC was conceived to be SIMPLE SIMPLE -> FAST MORE SPEED -> Pipelining Pipelining -> Complexity Complexity increases delays in worst-case paths Pipelines, Bypasses, Annulment,,... VLIWs,? Super-Scalars Primitive Machines with direct implementations Addressing features, eg index registers Generalization of registers and operand coding RISCs Complex instructions, addressing modes L20 Pipeline Issues 18