Lecture 4 - Pipelining

Size: px

Start display at page:

Download "Lecture 4 - Pipelining"

Abel Richard
5 years ago
Views:

1 CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley

2 Last time in Lecture 3 Microcoding became less attractive as gap beten RAM and ROM speeds reduced Complex instruction sets difficult to pipeline, so difficult to increase performance as gate count grew Load-Store RISC ISAs designed for efficient pipelined implementations Very similar to vertical microcode Inspired by earlier Cray machines (more on these later) Iron Law explains architecture design space Trade instructions/program, cycles/instruction, and time/cycle 2

3 An Ideal Pipeline a very porful idea! stage 1 stage 2 stage 3 stage 4 All objects go through the same stages No sharing of resources beten any two stages Propagation delay through all pipeline stages is equal The scheduling of an object entering the pipeline is not affected by the objects in other stages These conditions generally hold for industrial assembly lines, but instructions depend on each other! 3

4 To pipeline RISC-V: Pipelined RISC-V First build RISC-V without pipelining with CPI=1 Next, add pipeline registers to reduce cycle time while maintaining CPI=1 4

5 Lecture 3: Unpipelined Datapath for RISC-V PCSel br rind jabs pc+4 RegWriteEn MemWrite WBSel 0x4 Add Add clk PC clk inst Inst. 1 rs1 rs2 rd1 wa wd rd2 GPRs Imm Select ALU Control Br Logic Bcomp? ALU clk rdata Data wdata OpCode WASel ImmSel FuncSel Op2Sel 5

6 Lecture 3: Hardwired Control Table Opcode ImmSel Op2Sel FuncSel MemWr RFWen WBSel WASel PCSel ALU ALUi LW SW BEQ true BEQ false J JAL JALR * Reg Func no yes ALU rd BrType 12 * * no no * * BrType 12 * * no no * * * * * no no * * pc+4 IType 12 Imm Op no yes ALU rd pc+4 IType 12 Imm + no yes Mem rd pc+4 BsType 12 Imm + yes no * * pc+4 pc+4 jabs * * * no yes PC X1 jabs * * * no yes PC rd rind br Op2Sel= Reg / Imm WASel = rd / X1 WBSel = ALU / Mem / PC PCSel = pc+4 / br / rind / jabs 6

7 Pipelined Datapath PC 0x4 Add rdata Inst. rs1 rs2 rd1 wa wd rd2 GPRs Imm Select ALU rdata Data wdata fetch phase decode & Reg-fetch phase Clock period can be reduced by dividing the execution of an instruction into multiple cycles t C > max {t IM, t RF, t ALU, t DM, t RW } ( = t DM probably) Hover, CPI will increase unless instructions are overlapped execute phase memory phase write -back phase 7

8 Iron Law of Processor Performance Time = Instructions Cycles Time Program Program * Instruction * Cycle Instructions per program depends on source code, compiler technology, and ISA Cycles per instructions (CPI) depends on ISA and µarchitecture Time per cycle depends upon the µarchitecture and base technology Lecture 2 Lecture 3 Lecture 4 Microarchitecture CPI cycle time Microcoded >1 short Single-cycle unpipelined 1 long Pipelined 1 short 8

9 CPI Examples Microcoded machine 7 cycles 5 cycles 10 cycles Inst 1 Inst 2 Inst 3 Time Unpipelined machine Pipelined machine 3 instructions, 22 cycles, CPI=7.33 Inst 1 Inst 2 Inst 3 Inst 1 Inst 2 Inst 3 3 instructions, 3 cycles, CPI=1 3 instructions, 3 cycles, CPI=1 5-stage pipeline CPI 5!!! 9

10 Technology Assumptions A small amount of very fast memory (caches) backed up by a large, slor memory Fast ALU (at least for integers) Multiported Register files Thus, the following timing assumption is reasonable t IM t RF t ALU t DM t RW A 5-stage pipeline will be focus of our detailed design - some commercial designs have over 30 stage integer pipeline! 10

11 5-Stage Pipelined Execution PC 0x4 Add rdata Inst. I-Fetch (IF) rs1 rs2 rd1 wa wdrd2 GPRs Imm Select Decode, Reg. Fetch (ID) Execute (EX) rdata Data wdata (MA) time t0 t1 t2 t3 t4 t5 t6 t7.... instruction1 IF 1 ID 1 EX 1 MA 1 WB 1 instruction2 IF 2 ID 2 EX 2 MA 2 WB 2 instruction3 IF 3 ID 3 EX 3 MA 3 WB 3 instruction4 IF 4 ID 4 EX 4 MA 4 WB 4 instruction5 IF 5 ID 5 EX 5 MA 5 WB 5 ALU Write -Back (WB) 11

12 PC 0x4 Add rdata Inst. I-Fetch (IF) 5-Stage Pipelined Execution Resource Usage Diagram rs1 rs2 rd1 ws wdrd2 GPRs Imm Select Decode, Reg. Fetch (ID) ALU Execute (EX) rdata Data wdata (MA) Write -Back (WB) Resources time t0 t1 t2 t3 t4 t5 t6 t7.... IF I 1 I 2 I 3 I 4 I 5 ID I 1 I 2 I 3 I 4 I 5 EX I 1 I 2 I 3 I 4 I 5 MA I 1 I 2 I 3 I 4 I 5 WB I 1 I 2 I 3 I 4 I 5 12

13 Control for Pipelined Execution: ALU Instructions 0x4 Add 1 PC inst Inst rs1 rs2 rd1 wa wd rd2 GPRs Imm Select A B ALU Y rdata Data wdata R MD1 MD2 Not quite correct! We need an Instruction Reg () for each stage 13

14 PC Pipelined RISC-V Datapath without jumps F D E M W 0x4 Add inst Inst rs1 rs2 rd1 wa wd rd2 GPRs Imm Select RegWriteEn A B FuncSel ALU Y WASel MemWrite rdata Data wdata 1 WBSel R ImmSel Op2Sel MD1 MD2 Control Signals Need to Be Delayed 14

15 Instructions interact with each other in pipeline An instruction in the pipeline may need a resource being used by another instruction in the pipeline à structural hazard An instruction may depend on something produced by an earlier instruction Dependence may be for a data value à data hazard Dependence may be for the next instruction s ess à control hazard (branches, exceptions) 15

16 Resolving Structural Hazards Structural hazard occurs when two instructions need same hardware resource at same time Can resolve in hardware by stalling ner instruction till older instruction finished with resource A structural hazard can always be avoided by adding more hardware to design E.g., if two instructions both need a port to memory at same time, could avoid hazard by adding second port to memory Our 5-stage pipeline has no structural hazards by design Thanks to RISC-V ISA, which was designed for pipelining E.g., included a second add for branch ess calculations. 16

17 Data Hazards x4 x1 x1 0x4 Add 1 PC inst Inst rs1 rs2 rd1 wa wd rd2 GPRs Imm Select A B ALU Y rdata Data wdata R MD1 MD2... x1 x x4 x x1 is stale. Oops! 17

18 Resolving Data Hazards (1) Strategy 1: Wait for the result to be available by freezing earlier pipeline stages è interlocks 18

19 Feedback to Resolve Hazards FB 1 FB 2 FB 3 FB 4 stage 1 stage 2 stage 3 stage 4 Later stages provide information to earlier stages which might result in dependencies and need to stall instructions Controlling a pipeline in this manner works provided the instruction at stage i+1 can complete without any influence from instructions in stages 1 to i (otherwise deadlocks may occur) 19

20 Interlocks to resolve Data Hazards Stall Condition 0x4 Add bubble 1 PC inst Inst... x1 x x4 x rs1 rs2 rd1 wa wd rd2 GPRs Imm Select A B MD1 ALU Y MD2 rdata Data wdata R 20

21 Stalled Stages and Pipeline Bubbles time t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) x1 (x0) + 10 IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) x4 (x1) + 17 IF 2 ID 2 ID 2 ID 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) IF 3 IF 3 IF 3 IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) (I 5 ) stalled stages IF 4 ID 4 IF 5 EX 4 ID 5 MA 4 WB 4 EX 5 MA 5 WB 5 Resource Usage time t0 t1 t2 t3 t4 t5 t6 t7.... IF I 1 I 2 I 3 I 3 I 3 I 3 I 4 I 5 ID I 1 I 2 I 2 I 2 I 2 I 3 I 4 I 5 EX I I 2 I 3 I 4 I 5 MA I I 2 I 3 I 4 I 5 WB I I 2 I 3 I 4 I 5 - pipeline bubble 21

22 Interlock Control Logic stall C stall wa rs2 rs1? 0x4 Add bubble 1 PC inst Inst rs1 rs2 rd1 wa wd rd2 GPRs Imm Select A B ALU Y rdata Data wdata R Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions. MD1 MD2 22

23 Interlock Control Logic ignoring jumps & branches stall wa C stall rs1 rs2? re1 re2 C dest wa wa C dest 0x4 Add C re bubble 1 PC inst Inst rs1 rs2 rd1 wa wd rd2 GPRs Imm Select Should always stall if an rs field matches some rd? not every instruction writes a register not every instruction reads a register re A B MD1 ALU Y MD2 rdata Data wdata C dest R 23

24 Source & Destination Registers rd rs1 rs2 func10 opcode ALU rd rs1 Imm[11:0] func3 opcode ALUI/LW/JALR Imm[11:7] rs1 rs2 Imm[6:0] func3 opcode SW/Bcond Jump offset[24:0] opcode source(s) destination ALU rd rs1 func10 rs2 rs1, rs2 rd ALUI rd rs1 op imm rs1 rd LW rd M [rs1 + imm] rs1 rd SW M [rs1 + imm] rs2 rs1, rs2 - Bcond rs1,rs2 rs1, rs2 - true: PC PC + imm false: PC PC + 4 J PC PC + imm - - JAL x1 PC, PC PC + imm - x1 JALR rd PC, PC rs1 + imm rs1 rd 24

25 Deriving the Stall Signal C dest ws = Case opcode JAL else X1 rd = Case opcode ALU, ALUi, LW,JALR (ws x0) JAL C re re1 = Case opcode ALU, ALUi, LW, SW, Bcond, JALR 1 J, JAL 0 re2 = Case opcode ALU, SW,Bcond C stall stall = ((rs1 D =ws E ) E + (rs1 D =ws M ) M + (rs1 D =ws W ) W ) re1 D + ((rs2 D =ws E ) E + (rs2 D =ws M ) M + (rs2 D =ws W ) W ) re2 D 25

26 Hazards due to Loads & Stores Stall Condition What if x1+7 = x3+5? 0x4 Add bubble 1 PC inst Inst... M[x1+7] x2 x4 M[x3+5]... rs1 rs2 rd1 wa wd rd2 GPRs Imm Select A B MD1 ALU Y MD2 rdata Data wdata Is there any possible data hazard in this instruction sequence? R 26

27 Load & Store Hazards... M[x1+7] x2 x4 M[x3+5]... If x1+7 = x3+5 data hazard Hover, the hazard is avoided because our memory system completes writes in one cycle! Load/Store hazards are sometimes resolved in the pipeline and sometimes in the memory system itself. More on this later in the course. 27

28 CS152 Administrivia PS1 now due Thursday next ek instead of Tuesday. Quiz 1 on Tue Sep 20 will cover PS1, Lab1, lectures 1-5, and associated readings. 28

29 Resolving Data Hazards (2) Strategy 2: Route data as soon as possible after it is calculated to the earlier pipeline stage à bypass 29

30 Bypassing time t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) x1 x IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) x4 x IF 2 ID 2 ID 2 ID 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) IF 3 IF 3 IF 3 IF 3 ID 3 EX 3 MA 3 (I 4 ) stalled stages IF 4 ID 4 EX 4 (I 5 ) IF 5 ID 5 Each stall or kill introduces a bubble in the pipeline CPI > 1 A new datapath, i.e., a bypass, can get the data from the output of the ALU to its input time t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) x1 x IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) x4 x IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) IF 4 ID 4 EX 4 MA 4 WB 4 (I 5 ) IF 5 ID 5 EX 5 MA 5 WB 5 30

31 stall Adding a Bypass 0x4 Add x4 x1... bubble x1... E M W 1 PC inst Inst D rs1 rs2 rd1 wa wd rd2 GPRs Imm Select ASrc A B MD1 ALU Y MD2 rdata Data wdata R... When does this bypass help? (I 1 ) x1 x x1 M[x0 + 10] JAL 500 (I 2 ) x4 x yes x4 x no x4 x no 31

32 The Bypass Signal Deriving it from the Stall Signal stall = ( ((rs1 D =ws E ) E + (rs1 D =ws M ) M + (rs1 D =ws W ) W ) re1 D +((rs2 D =ws E ) E + (rs2 D =ws M ) M + (rs2 D =ws W ) W ) re2 D ) ws = Case opcode JAL X1 else rd ASrc = (rs1 D =ws E ) E re1 D = Case opcode ALU, ALUi, LW, JALR (ws x0) JAL Is this correct? No because with out datapath modification only ALU and ALUi instructions can benefit from this bypass! Split E into two components: -bypass, -stall 32

33 Bypass and Stall Signals Split E into two components: -bypass, -stall -bypass E = Case opcode E ALU, ALUi (ws 0)... off -stall E = Case opcode E LW, JALR (ws 0) JAL on... off ASrc = (rs1 D =ws E ) -bypass E re1 D stall = ((rs1 D =ws E ) -stall E + (rs1 D =ws M ) M + (rs1 D =ws W ) W ) re1 D + ((rs2 D = ws E ) E + (rs2 D = ws M ) M + (rs2 D = ws W ) W ) re2 D 33

34 Fully Bypassed Datapath stall PC for JAL,... 0x4 Add bubble ASrc E M W 1 PC inst Inst Is there still a need for the stall signal? D rs1 rs2 rd1 wa wd rd2 GPRs Imm Select BSrc A B MD1 ALU Y MD2 rdata Data wdata stall = (rs1 D =ws E ). (opcode E =LW E ).(ws E 0 ).re1 D + (rs2 D =ws E ). (opcode E =LW E ).(ws E 0 ).re2 D R 34

35 Time Pipeline CPI Examples Inst 1 Inst 2 Inst 3 Measure from when first instruction finishes to when last instruction in sequence finishes. 3 instructions finish in 3 cycles CPI = 3/3 =1 Inst 1 Inst 2 Bubble Inst 3 Inst 1 Bubble 1 Inst 2 Inst Bubble 3 2 Inst 3 3 instructions finish in 4 cycles CPI = 4/3 = instructions finish in 5 cycles CPI = 5/3 =

36 Acknowledgements These slides contain material developed and copyright by: Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) MIT material derived from course UCB material derived from course CS252 47

Lecture 7 Pipelining. Peng Liu.

Lecture 7 Pipelining. Peng Liu. Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn 1 Review: The Single Cycle Processor 2 Review: Given Datapath,RTL -> Control Instruction Inst Memory Adr Op Fun Rt