Pipeline Issues. This pipeline stuff makes my head hurt! Maybe it s that dumb hat. L17 Pipeline Issues Fall /5/02

Size: px

Start display at page:

Download "Pipeline Issues. This pipeline stuff makes my head hurt! Maybe it s that dumb hat. L17 Pipeline Issues Fall /5/02"

Arnold Lamb
6 years ago
Views:

1 Pipeline Issues This pipeline stuff makes my head hurt! Maybe it s that dumb hat L17 Pipeline Issues 1

2 Recalling Data Hazards PROLEM: Subsequent instructions can reference the contents of a register well before the pipeline stage where the register is written. DD(r1, r2, r3) CMPLEC(r3, 100, r0) MULC(r3, 100, r4) SU(r0, r4, r5) IF RF LU W SOLUTION #1: see. i i+1 i+2 i+3 i+4 i+5 i+6 DD CMP MUL SU DD CMP MUL SU DD CMP MUL SU DD CMP MUL SU Deal with it in SOFTWRE; expose the pipeline for all to SOLUTION #2: dd special hardware to maintain the sequential execution semantics of the IS. L17 Pipeline Issues 2

3 IF IR RF RF SU(r0,r4,r5) LU IR LU MULC(r3,100,r4) W IR W CMPLEC(r3,100,r0) W ypass Paths LU R1 RD1 ut there are some problems that YPSSING CN T FIX! Y Y W WD WE R2 RD2 ypass muxes dd special cheat paths, called YPSSES, that route the results of the LU and W stages to the RF stage, thus substituting the register s old contents with a value that will be written to that register at some point in the future. Detection of these cases has to be incorporated into the decoding logic of the RF stage, which basically looks at the instructions in the LU and W stage to see if their destination register matches a source register reference. L17 Pipeline Issues 3

4 Load Hazards Consider LODS: Can we fix all these problems using bypass paths? LD(r1, 0, r4) DD(r4, r1, r5) XOR(r3, r4, r6) IF RF LU W i i+1 i+2 i+3 i+4 i+5 i+6 LD DD XOR LD DD XOR LD DD XOR LD DD XOR The hazard between the XOR and the LD can be addressed by our established bypass paths L17 Pipeline Issues 4

5 PCSEL ILL Xdr OP 4 3 JT PC IF 00 Load Hazard (easy) +4 PC RF D IR RF Fetch Ra <20:16> Rb: <15:11> Rc <25:21> R2SEL XOR(r3, r4, r6) <PC>+C + C: <15:0> << 2 sign-extended Z R1 W RD1 JT C: <15:0> sign-extended R2 RD2 LD(r1, 0, r4) PC LU PC MEM IR LU IR MEM SEL 1 0 LUFN LU Y Y MEM 1 0 SEL dr D LU D MEM WD R/W Data RD LU The XOR operand r4 can simply be bypassed from the output of the memory in the W stage to the RF stage by our normal bypass path. (N: SME RF S OVE!) XP Rc <25:21> WSEL 0 1 W WD WE WDSEL WERF Write ack L17 Pipeline Issues 5

6 Structural Data Hazard The XOR hazard is pretty easy, but LD(r1, 0, r4) DD(r1, r4, r5) XOR(r3, r4, r6) IF RF LU W i i+1 i+2 i+3 i+4 i+5 i+6 LD DD XOR LD DD XOR??? LD DD XOR LD DD XOR How do we fix this one? In a 4-stage pipeline, for a LD instruction fetched during clock i, the data from memory isn t returned from memory until late into cycle i+3. ypassing can fix the XOR but not DD! L17 Pipeline Issues 6

7 PCSEL ILL Xdr OP 4 3 JT PC IF 00 Load Hazard (hard) +4 PC RF D IR RF Fetch Ra <20:16> Rb: <15:11> Rc <25:21> R2SEL DD(r4, r1, r5) LD(r1, 0, r4) PC LU <PC>+C + C: <15:0> << 2 sign-extended IR LU Z SEL 1 0 LUFN R1 W RD1 JT C: <15:0> sign-extended LU Y 1 0 SEL R2 RD2 D LU LU The r4 operand to the DD instruction hasn t yet been fetched from memory. PC MEM IR MEM Y MEM dr D MEM WD R/W Data RD It exists NOWHERE on our data paths we can t solve this problem by bypassing! (N: SME RF S OVE!) XP Rc <25:21> WSEL 0 1 W WD WE WDSEL WERF Write ack L17 Pipeline Issues 7

8 Load Delay ypassing can t fix the problem with DD since the data simply isn t available! We have to add some pipeline interlock hardware to stall DD s execution. LD(r1, 0, r4) DD(r1, r4, r5) XOR(r3, r4, r6) IF RF LU W i i+1 i+2 i+3 i+4 i+5 i+6 LD DD XOR XOR LD DD DD XOR LD DD XOR LD DD XOR If the compiler knows about a machine s load delay, it can often rearrange code sequences to eliminate such hazards. Many compilers provide machine-specific instruction scheduling. L17 Pipeline Issues 8

9 Timing & Pipelining ut, but, what about FSTER processors? FCT: Processors have become very fast relative to memories! nd this gap continues to grow Do we just increase the clock period to accommodate this bottleneck component? LTERNTIVE: Longer pipelines. 1. dd MEMORY WIT stages between STRT of read operation & return of data. 2. uild pipelined memories, so that multiple (say, N) memory transactions can be in progress at once. These steps add load delay slots; hence 3. Stall pipeline on unbypassable load delays. 4-Stage pipeline requires RED access in less than one clock. 5-Stage pipeline would allow nearly two clocks... L17 Pipeline Issues 9

10 Xdr ILL OP JT PCSEL PC IF PC RF D IR RF Omits some detail NO bypass or interlock logic Fetch 5-stage Pipeline Ra <20:16> Rb: <15:11> Rc <25:21> R2SEL + C: <15:0> << 2 sign-extended <PC RF >+4+C*4 Z R1 W RD1 JT C: <15:0> sign-extended R2 RD2 SEL SEL PC LU IR LU D LU LUFN LU Y LU PC MEM IR MEM Y MEM dr D MEM WD R/W ddress available right after instruction enters pipe stage PC W IR W Y W Data almost 2 clock cycles WSEL Rc <25:21> 0 1 W XP WD WE WDSEL WERF RD Write ack Data needed right before rising clock edge at end of Write ack pipe stage L17 Pipeline Issues 10

11 Xdr ILL OP JT PCSEL PC IF PC RF D IR RF Omits some detail NO bypass or interlock logic Fetch 5-stage pipeline Ra <20:16> Rb: <15:11> Rc <25:21> PC LU + C: <15:0> << 2 sign-extended <PC RF >+4+C*4 IR LU Z SEL 1 0 LUFN R1 W RD1 JT C: <15:0> sign-extended LU Y 1 R2 RD2 0 SEL R2SEL D LU LU We wanted a simple, clean pipeline but added IR IF mux to annul branch-slot instructions PC MEM PC W IR MEM IR W WSEL Rc <25:21> 0 1 W XP Y MEM Y W WD WE WDSEL WERF dr D MEM WD Data RD R/W Write ack added / bypass muxes to get data before it s written to regfile added LE/muxes to freeze IF/RF stage so we can wait for LD to reach W stage L17 Pipeline Issues 11

12 RF-stage ypass Details 1, 2, 3, 4 Hum, it looks like we have a few extra inputs on that mux from LU/MEM/W RD1 RD2 from LU/MEM/W Note: Note: can can use use distributed distributed mux mux built built from from tristate tristate drivers. drivers. ypass ypass PC+4+4*SXT(C) JT Z SXT(C) 1 0 SEL 1 0 SEL D To LU We ve been a little sloppy about this detail To LU To Mem L17 Pipeline Issues 12

13 ypass Implementation from mem Part 2 of the WDSEL mux Part 1 of the WDSEL mux to regs ypass From W Y MEM ypass From MEM Y LU ypass From LU alu PC LU from regs to alu Select 0 To reduce the amount of bypass logic, the WDSEL mux has been split: choice between LU and PC+4 is made in LU stage, choice between LU/PC and MEMDT is made is W stage. L17 Pipeline Issues 13

14 ypass Logic eta ypass logic (need two copies for / data): Ra or Rb/Rc Regfile (no bypass) Rc (W)* W bypass Rc (MEM)* MEM bypass Rc (LU)* LU bypass 31 5-bit compare Select 0 * If instruction is a ST (doesn t write into regfile), set RC for LU/MEM/W to R31 L17 Pipeline Issues 14

15 Linkage Write Timing The code: ssume Reg[LP] = DD(r31, r31, LP) R(f, LP) Reg[LP] =.+4 x: SU(LP, 1, LP)... f: XOR(LP, r31, r0) OR(r31, LP, r1) DD(r31, LP, r2) i i+1 i+2 i+3 i+4 i+5 i+6 IF RF LU MEM W DD R SU XOR OR DD DD R XOR OR DD DD R XOR OR DD R XOR DD R R Decision Time DD writes R writes Can Can we we make XOR s regfile access work work by by bypassing? L17 Pipeline Issues 15

16 PCSEL STLL Xdr 4 ILL OP 3 JT PC SEL 00 R/JMP STLL +4 PC RF XOR(r28 ) PC LU PC RF +4+4*SXT(C) STLL + nnul RF, ypass D IR RF nnul IF IR LU YPSSES PC RF +4+4*SXT(C) LUFN Ra:<20:16> R1 RD1 1 0 SEL LU Rb:<15:11> JT SXT(C) Y Z R2 RD2 Rc:<25:21> 1 0 SEL, YPSS 0 1 R2SEL YPSSES D LU Fetch LU PC bypass For R/JMP, Rc value is taken from PC, not LU. So we have to add bypass paths for PC LU and PC MEM. PC MEM R(,r28) PC W, ypass IR MEM IR W Y MEM Y W, YPSS dr D MEM WD Data RD R/W PC W is already taken care of if we bypass W stage from output of WDSEL mux WDSEL Rc:<25:21>, YPSS W WD WE WERF Write ack L17 Pipeline Issues 16

17 Unused Opcode Traps IDE: TRP illegal instructions to a special routine in the Operating System, which can Interpret them in software; or Print humane error report. IMPLEMENTTION: On ad Opcode (discovered in RF Stage): Select IllOp adr as next PC nnul instruction in IF stage Substitute NE(r31,0,XP) for bad instruction - will (eventually) store PC+4 into XP... need bypass paths to make XP usable immediately by code at IllOp! User program:... D(...) Illegal instr. r:... Operating System: UUO Handler IllOp: ST(r0,...) Save a reg, LD(xp,-4,r0) Fetch bad instr... LD(...,r0) Restore regs, JMP(xp) Return to pgm. L17 Pipeline Issues 17

18 Xdr ILL OP JT PCSEL STLL Illegal Opcode Traps PC SEL 00??? STLL +4 PC RF STLL D 0 1 nnul IF IR RF Fetch D NE(,XP) PC LU PC RF +4+4*SXT(C) + NE(R31,0,XP) nnul RF, YPSS IR LU YPSSES PC RF +4+4*SXT(C) LUFN Ra:<20:16> R1 RD1 1 0 SEL LU Rb:<15:11> JT SXT(C) Y Z R2 RD2 Rc:<25:21> 1 0 SEL, YPSS 0 1 R2SEL YPSSES D LU LU ad opcode decoded in RF stage: PC address of IllOp handler nnul instruction in IF PC MEM PC W, YPSS IR MEM IR W Y MEM Y W, YPSS dr D MEM WD Data R/W Force NE(R31,0,XP) in RF stage will save PC+4 in XP when it reaches W stage RD WDSEL Rc:<25:21>, YPSS W WD WE WERF Write ack L17 Pipeline Issues 18

19 Taking Exception In general, we d like to annul LL instructions following one that causes a trap or fault: FREEZE state at time of exception, for inspection by handler code. ILLEGL INSTRUCTIONS are recognizable in RF stage of pipe; are LL faults & traps? CONSIDER: RITHMETIC EXCEPTIONS: divide by zero, etc. Caught by LU subsystem, during processing of data in LU stage MEMORY FULTS: Program reference to illegal memory location... Caught by MEMORY subsystem, during processing of address input in MEM stage L17 Pipeline Issues 19

20 PCSEL STLL Xdr 4 ILL OP 3 JT PC SEL 00 nnulment Logic +4 D STLL PC RF STLL 0 1 nnul IF IR RF Fetch Fault in??? stage: PC LU PC RF +4+4*SXT(C) + NE(R31,0,XP) nnul RF, YPSS NE(R31,0,XP) nnul LU IR LU 1 0 YPSSES PC RF +4+4*SXT(C) LUFN Ra:<20:16> R1 RD1 1 0 SEL LU Rb:<15:11> JT SXT(C) Y Z R2 RD2 Rc:<25:21> 1 0 SEL, YPSS 0 1 R2SEL YPSSES D LU LU PC address of fault handler Force NE(R31,0,XP) in??? stage will save PC+4 in XP when it reaches W stage MEM FULT PC MEM PC W, YPSS NE(R31,0,XP) nnul MEM 2 IR MEM 1 0 IR W Y MEM Y W, YPSS dr D MEM WD Data R/W nnul all following instructions (those earlier in the pipeline): called flushing the pipe RD WDSEL Rc:<25:21>, YPSS W WD WE WERF Write ack L17 Pipeline Issues 20

21 synchronous I/O Interrupts This should be easy. Take, for example, The interrupted code:... DD(...) Interrupt SU(...) Taken MUL(...) HERE XOR(...)... The interrupt handler: xh: OR(...)... JMP(xp) Suppose key struck, interrupt requested (via IRQ) during the fetch of DD. Then let s Select Xdr (handler) as next PC Leave DD in pipeline; NO annulment! Code handler to return to SU instruction. Can this work??? Let s find out L17 Pipeline Issues 21

22 synchronous Interrupt Timing IF RF LU MEM W xh:... DD(...) SU(...) MUL(...) XOR(...)... Interrupt Taken HERE OR(...) interrupt handler... JMP(xp) Interrupt Occurs PC = Xadr?? i i+1 i+2 i+3 i+4 i+5 i+6 DD OR DD OR... DD OR... DD OR... DD OR... PROLEM: When does old PC+4 get written to XP??? L17 Pipeline Issues 22

23 Making Interrupts Work lternative: When taking interrupt, NNUL instruction in IF stage... UT instead of changing it to a, change it to NE(r31,0,XP) This will cause PC+4 of annulled instruction to be written to XP! CODE HNDLER to return to Reg[XP]-4 (since the annulled instruction is never executed) IF RF LU MEM W i i+1 i+2 i+3 i+4 i+5 i+6 DD OR NE OR... NE OR... NE OR... NE OR Interrupt taken L17 Pipeline Issues 23

24 Smart Interrupt Handler The interrupted code:... Interrupt taken HERE, DD(...) DD instruction annulled SU(...) MUL(...) XOR(...)... The interrupt handler: xh: OR(...)... SUC(xp,4,xp) djust XP so we JMP(xp) return to annulled instruction (DD) L17 Pipeline Issues 24

25 PCSEL STLL STLL Xdr 4 ILL OP 3 JT 2 PC SEL PC RF PC LU PC MEM PC W 1 0 PC RF +4+4*SXT(C), YPSS, YPSS STLL + NE(R31,0,XP) nnul RF NE(R31,0,XP) nnul LU NE(R31,0,XP) nnul MEM D nnul IF PC RF +4+4*SXT(C) Rc:<25:21> Ra:<20:16> Rb:<15:11> Rc:<25:21> W WD WE WERF IR RF IR LU IR MEM IR W YPSSES NE(R31,0,XP) LUFN R1 RD1 1 0 SEL JT SXT(C) LU Y Y MEM Y W Z R2 RD2 1 0 SEL, YPSS, YPSS 0 1 WDSEL, YPSS R2SEL YPSSES dr 5-stage Pipeline: D LU D MEM WD Fetch LU Data RD R/W Write ack Final Version Can annul instruction at each stage Can force instruction to NE(R31,0,XP) in each stage will save PC+4 in XP when it reaches W stage Can stall IF and RF stages while waiting for LD result to reach W Can bypass results from LU, MEM and W back to RF L17 Pipeline Issues 25

26 Pipeline Review Simple unpipelined eta: 1 cycle/instruction long cycle time: mem+regs+alu+mem 2-Stage pipeline: increased throughput (<2x) introduced branch delay slots Choice of executing or annulling inst. after branch 5-stage pipeline: increased throughput (3x???) branch delay slots delayed register writeback (3 cycles) dd bypass paths (10) to access correct value memory data available only in W stage Introduce s at IR LU, stall IF and RF stages until LD result ready handle RF/LU/MEM stage exceptions Save PC+4 in XP (fake a R) annul following insts. (those earlier in pipeline) implement interrupts Throw away IF inst., save PC+4 in XP, fix return extra HW due to pipelining s to hold values between stages Data bypass muxes in RF stage Inst. muxes for rewriting code to annul or save PC L17 Pipeline Issues 26

27 RISC = Simplicity??? The P.T. arnum World s Tallest Dwarf Competition World s Most Complex RISC? VLIWs,? Super-Scalars ddressing features, eg index registers Pipelines, ypasses, nnulment,,... Primitive Machines: direct implementations Generalization of registers and operand coding RISCs Complex instructions, addressing modes L17 Pipeline Issues 27

CPU Pipelining Issues

Spring 25 3/24 Lecture page 1 CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! L17 Pipeline Issues 1 :J: T REG IRREG 4-Stage minimips