CPU Pipelining Issues

Size: px

Start display at page:

Download "CPU Pipelining Issues"

Stephany Reed
6 years ago
Views:

1 Spring 25 3/24 Lecture page 1 CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! L17 Pipeline Issues 1 <31:29>:J<25:0>: T REG IRREG 4-Stage minimips pipeline hazards R1 W Z R2 SEL 1 0 SEL T IR FN IRMEM Y MEM MEM dr R/W ata WSEL 3 SEL ack W WE W (N: SME RF S OVE!) L17 Pipeline Issues 2

2 Spring 25 3/24 Lecture page 2 ypass Paths IR RF R1 R2 IR ypass muxes Y IR W Y W W WE L17 Pipeline Issues 3 Structural ata Hazard Consider LOS: Can we fix all these problems using our previous bypass paths? lw $t4,0($t1) add $t5,$t1,$t4 xor $t6,$t3,$t4 IF RF W i i1 i2 i3 i i5 i6 L17 Pipeline Issues 4

3 Spring 25 3/24 Lecture page 3 <31:29>:J<25:0>: T REG IRREG 4-Stage minimips R1 W Z R2 SEL 1 0 SEL T IR FN IRMEM Y MEM MEM dr R/W ata WSEL 3 SEL ack W WE W (N: SME RF S OVE!) L17 Pipeline Issues 5 Structural ata Hazard Consider LOS: Can we fix all these problems using our previous bypass paths? lw $t4,0($t1) add $t5,$t1,$t4 xor $t6,$t3,$t4 IF RF W i i1 i2 i3 i i5 i6 For a lw instruction fetched during clock i, data isn t returned from memory until late into cycle i3. ypassing will fix xor but not add! L17 Pipeline Issues 6

4 Spring 25 3/24 Lecture page 4 Load elays ypassing can t fix the problem with add since the data simply isn t available! In order to fix it we have to add pipeline interlock hardware to stall add s execution, or else program around it. lw $t4,0($t1) add $t5,$t1,$t4 xor $t6,$t3,$t4 IF RF W i i1 i2 i3 i i5 i6 xor lw add add xor dding stalls to the pipeline in order to assure proper operation is sometimes called inserting pipeline ULES This requires inserting a MUX just before the instruction register of the stage, IR, to annul the add (by inserting a NOP) as well as, clock enables on the and IR pipeline registers of earlier pipeline stages to stall the execution without annuling any instructions. This is how the simulator, SPIM works. L17 Pipeline Issues 7 Punting on Load Interlock Early versions of MIPS did not include a pipeline interlock, thus, requiring the compiler/programmer to work around it. lw $t4,0($t1) nop add $t5,$t1,$t4 xor $t6,$t3,$t4 IF RF W i i1 i2 i3 i i5 i6 If compiler knows about load delay, it can often rearrange the code sequence to eliminate the hazard. Many compilers can provide implementation-specific instruction scheduling. This requires no additional H/W, but it further confuses the instruction semantics. We ll include interlocks for SPIM compatibility. L17 Pipeline Issues 8

5 Spring 25 3/24 Lecture page 5 Load elays (cont d) ut, but, what about FSTER processors? FCT: Processors have been become very fast relative to memories! Can we just stall the pipe longer? dd more NOPs? LTERNTIVE: Longer pipelines. 1. dd MEMORY WIT stages between STRT of read operation & return of data. 2. uild pipelined memories, so that multiple (say, N) memory transactions can be in progress at once. 3. (Optional). Stall pipeline when the N limit is exceeded. 4-Stage pipeline requires RE access in LESS than one clock. Sadly, this is our bottleneck. If we want to go faster will have to surround it with pipeline stages SOLUTION: 5-Stage pipeline that allows nearly two clocks for data memory accesses... L17 Pipeline Issues 9 <31:29>:J<25:0>: T REG IRREG 5-Stage minimips Omits some details NO bypass or interlock logic R1 W Z R2 T IR FN SEL 1 0 SEL MEM IRMEM YMEM MEM ddress is available right after instruction enters stage ack IRW YW W WSEL W WE W 3 SEL dr R/W ata almost 2 clock cycles ata is needed just before rising clock edge at end of ack stage L17 Pipeline Issues 10

6 Spring 25 3/24 Lecture page 6 One More Fly in the Ointment There is one more structural hazard that we have not discussed. That is, the saving, and subsequent accesses, of the return address resulting from the jump-and-link, jal, instruction. Moreover, given that we have bought into a single delay-slot, which is always executed, we now need to store the address of the instruction FOLLOWING the delay slot instruction. We need to return here, to 8, not. Once more we need to rewrite the IS spec! jal sqr # call procedure addi $a0,$0,10 # delay slot addi $t0,$v0,-1 # return address L17 Pipeline Issues 11 Return ddress s IF RF MEM W # The code: ssume Reg[LP] 1... add $ra,$0,$0 jal f addi $t1,$ra,-4 # In delay slot... f: xor $t0,$ra,$0 or $r1,$0,$ra add $t2,$0,$ra Can Can we we make make the the regfile regfileaccesses of of i i1 i2 i3 i i5 i6 the the 3 instructions instructions following following the the jal jal add jal addi xor or add work work by by bypassing? bypassing? add jal addi xor or add Where Where do do we we get get add jal addi xor or the the right right return return address add jal addi xor address from? from? add jal addi R ecision Time I reads R writes L17 Pipeline Issues 12

7 Spring 25 3/24 Lecture page 7 <31:29>:J<25:0>: T addi $9,$31,-4 jal f REG IRREG, ypass FN R1 W Z T IR R2 SEL 1 0 SEL JL ypasses On JLs, the register file saves the next address from the ELY SLOT instruction (often 8). Note this bypass is routed from the pipeline not from the output. Thus, we need to add bypass paths for REG. add $ra,$0,$0 MEM IRMEM YMEM MEM IRW W YW dr R/W ata WSEL 3 SEL ack W WE W L17 Pipeline Issues 13 <31:29>:J<25:0>: T REG IRREG JL ypasses We need another bypass. xor $8,$31,$0, ypass addi $9,$31,-4 FN R1 W Z T IR R2 SEL 1 0 SEL In this case, the bypass path supplies the $31 operand for the XOR instruction. jal f add $ra,$0,$0 ack MEM IRMEM YMEM MEM IRW YW W WSEL W WE W 3 SEL dr R/W ata L17 Pipeline Issues 14

8 Spring 25 3/24 Lecture page 8 <31:29>:J<25:0>: T or $1,$0,$31 xor $8,$31,$0 REG IRREG addi $9,$31,-4, ypass jal f FN MEM IRMEM YMEM MEM R1 IRW W YW W Z T IR WSEL 3 SEL R2 SEL 1 0 SEL JL ypasses dr R/W ata nd, we need another MEM bypass. In this case, the bypass path supplies the $31 operand for the OR instruction. W is already taken care of, for the following, using the W stage bypass at the output of the SEL mux. ack W WE W L17 Pipeline Issues 15 <31:29>:J<25:0>: T REG IRREG NOP FN MEM IRMEM YMEM MEM R1 IRW YW W W Z T IR WSEL 3 5-Stage minimips We wanted a simple, clean pipeline but SEL R2 SEL 1 0 SEL dr R/W ata broke the sequential semantics of IS by adding a branch delay-slot and early branch resolution logic added / bypass muxes to get data before it s written to regfile added CLK EN to freeze IF/RF stages so we can wait for lw to reach W stage ack W WE W L17 Pipeline Issues 16

9 Spring 25 3/24 Lecture page 9 ypass MUX etails The previous diagram was oversimplified. Really need for the bypass muxes to precede the and muxes to provide the correct values for the jump target (), write data, and early branch decision logic. from /MEM/W/ from /MEM/W/ ypass shamt 16 (imm) ypass SEL Z 1 0 SEL To To To Mem L17 Pipeline Issues 17 ypass Logic minimips bypass logic (need two copies for / data): rs or rt rt or rd (W)* Regfile (no bypass) W bypass rt or rd (MEM)* MEM bypass rt or rd ()* bypass 0 5-bit compare Select 0 * If instruction is a sw (doesn t write into regfile), set rt for /MEM/W to $0 L17 Pipeline Issues 18

10 Spring 25 3/24 Lecture page 10 <31:29>:J<25:0>: T REG IRREG NOP Final 5-Stage minimips FN MEM IRMEM YMEM MEM R1 W T IR Z R2 SEL 1 0 SEL dded branch delay slot and early branch resolution logic to fix a CONTROL hazard dded lots of bypass paths and detection logic to fix various STRUCTURL hazards dded pipeline interlocks to fix load delay STRUCTURL hazard IRW W YW dr R/W ata WSEL 3 SEL ack W WE W L17 Pipeline Issues 19 Pipeline Summary (I) Started with unpipelined implementation direct execute, 1 cycle/instruction it had a long cycle time: mem regs alu mem wb We ended up with a 5-stage pipelined implementation increase throughput (3x???) delayed branch decision (1 cycle) Choose to execute instruction after branch delayed register writeback (3 cycles) dd bypass paths (6 x 2 12) to forward correct value memory data available only in W stage Introduce NOPs at IR, to stall IF and RF stages until L result was ready L17 Pipeline Issues 20

11 Spring 25 3/24 Lecture page 11 Pipeline Summary (II) Fallacy #1: Pipelining is easy Smart people get it wrong all of the time! Costs? Re-spins of the design. Force S/W folks to devise program/compiler workarounds. Fallacy #2: Pipelining is independent of IS Many IS decisions impact how easy/costly it is to implement pipelining (i.e. branch semantics, addressing modes). ad decisions impact future implementations. (delay slot vs. annul?, load interlocks?) and break otherwise clean semantics. For performance, S/W must be aware! Fallacy #3: Increasing Pipeline stages improves performance iminishing returns. Increasing complexity. Can introduce unusable delay slots, long interlock stalls. Power4: 7; Power5: 23 Pentium: 5; Pentium/MMX:6; PentiumIII: 10; Pentium4: 20, 31 L17 Pipeline Issues 21 RISC Simplicity??? The P.T. arnum World s Tallest warf Competition World s Most Complex RISC? VLIWs, Super-Scalars? ddressing features, eg index registers Pipelines, ypasses, nnulment,,... Primitive Machines with direct implementations Generalization of registers and operand coding RISCs Complex instructions, addressing modes L17 Pipeline Issues 22

CPU Pipelining Issues

CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! Finishing up Chapter 6 L20 Pipeline Issues 1 Structural Data Hazard Consider LOADS: Can we fix all