Pipeline design. Mehran Rezaei

Pipeline design Mehran Rezaei

How Can We Improve the Performance? Exec Time = IC * CPI * CCT Optimization IC CPI CCT Source Level * Compiler * * ISA * * Organization * * Technology * With Pipelining We want to get 5 times faster Clock rate Single Cycle machine: CPI is one

Analogy order pay pickup

lw or add sw and Pipelining add sub icroprocessor lw fetch decode ALU mem writeback add

Pipeline design Break the execution of the instruction into cycles. Design a separate datapath stage for the execution performed during each cycle. Build pipeline registers to communicate between the stages.

Shift Left 2 npcsle pc Opcode Instruction ExtOp Cont Unit Reg RegDst Addr Addr2 Addr ALUSrc 2 Mem OVF Branch ALUCtr MemtoReg Mem Funct Extension ALUOp ALU Cont

Shift Left 2 ID EXE MEM WB pc Instruction Addr Addr2 Addr 2 Extension IF

Instruction Fetch Design a datapath that can fetch an instruction from memory every cycle. Use PC to index memory to read instruction Increment the PC (assume no branches for now) everything needed to complete execution to the pipeline register (IF/ID) The next stage will read this pipeline register. Note that pipeline register must be edge triggered

IF PC+ Inst. Instruction pc IF/ID Registers

ID IF/ID Registers PC+ Inst. Addr Addr2 Addr 2 PC+ RegA RegB IMM ID/EXE Registers Extension Rt Rd

Shift Left 2 EXE ID/EXE Registers PC+ RegA RegB IMM Rt Rd Br. Tr. Add. ALUres RegB Rt/Rd EXE/MEM Registers

MEM Br. Tr. Add. Mem ALUres EXE/MEM Registers RegB Rt/Rd ALUres Rt/Rd MEM/WB Registers

WB MEM/WB Registers Mem ALUres Rt/Rd

Shift Left 2 ID EXE MEM WB pc Instruction Addr Addr2 Addr 2 Extension IF

Example Run the following code on our pipeline machine add $,$0,$3 lw $,20($2) sub $5,$6,$6 sw $7,8($8) add $9,$,$3

Shift Left 2 pc Instruction add $,$0,$3 0 3 R0 R2 R R6 R8 0 8 5 8 6 7 9 R R3 R5 R7 R9 add $,$0,$3? 3 Extension

Shift Left 2 add $,$0,$3 pc Instruction Lw $,20($2) 2 R0 R2 R R6 R8 0 8 5 8 6 7 9 R R3 R5 R7 R9 0 5 lw $,20($2) 20? Extension 3

Shift Left 2 lw $,20($2) add $,$0,$3 pc Instruction Sub $5,$6,$6 6 6 R0 R2 R R6 R8 0 8 5 8 6 7 3 9 R R3 R5 R7 R9 8 5 sub $5,$6,$6 6 Extension 5 20?

Shift Left 2 sub $5,$6,$6 lw $,20($2) add $,$0,$3 pc Instruction sw $7,0($8) R0 R2 R R6 R8 0 8 5 8 6 7 9 R R3 R5 R7 R9 6 6 28 sw $7,8($8) Extension 6 5 5

Shift Left 2 sw $7,8($8) sub $5,$6,$6 lw $,20($2) add $,$0,$3 pc Instruction add $9,$,$3 R0 R2 R R6 R8 0 8 5 5 8 6 7 9 R R3 R5 R7 R9 7 0 28 200 200 Extension 8 7 5 add $9,$,$3

Clk Next PC Recall: Single cycle control! Ideal Instruction Memory Instruction 32 Rd 5 Instruction Rs 5 Rt 5 Rw Ra Rb 32 32-bit Registers A 32 B Control Control Signals ALU Conditions 32 In Ideal Memory Out Clk 32 Clk path

Stationary Control The Main Control generates the control signals during Reg/Dec Control signals for Exec (ExtOp, ALUSrc,...) are used cycle later Control signals for Mem (MemWr Branch) are used 2 cycles later Control signals for Wr (MemtoReg MemWr) are used 3 cycles later Reg/Dec Exec Mem Wr ExtOp ExtOp ALUSrc ALUSrc IF/ID Register Main Control ALUOp RegDst MemWr Branch MemtoReg ID/Ex Register ALUOp RegDst MemWr Branch MemtoReg Ex/Mem Register MemWr Branch MemtoReg Mem/Wr Register MemtoReg RegWr RegWr RegWr RegWr

Next PC PC Mem Acces s Mem Reg File Exec Reg. File Inst. Mem Decode path + Stationary Control IR fun rt rs op rs rt v rw wb me ex im v rw wb me Mem Ctrl v rw wb WB Ctrl A S M B D

Shift Left 2 pc Opcode Instruction ExtOp Cont Unit RegDst npcsle Reg ALUSrc Addr Addr2 Addr 2 Mem MemtoReg Mem OVF Branch ALUCtr Funct Extension ALUOp ALU Cont 25

Shift Left 2 ID EXE MEM WB pc Instruction Addr Addr2 Addr 2 Extension IF 26

Shift Left 2 ID EXE MEM WB pc Instruction Addr Addr2 Addr 2 Extension IF 27

Pipeline timing diagram add $,$0,$3 lw $,20($2) sub $5,$6,$6 sw $7,8($8) add $9,$,$3 IF ID EXE MEM WB IF ID EXE MEM WB IF ID EXE MEM WB IF ID EXE MEM WB IF ID EXE MEM WB 28

What are they? Hazards How do you detect them? How do you deal with them? 29

Shift Left 2 pc Instruction PC+ instruction PC+ vala valb target ALUres eq? mdata Extension IMM dest valb dest ALUres dest 30

Pipeline cycles for add IF - Fetch: read instruction from memory ID - Decode: read source operands from reg EXE - Execute: calculate sum MEM - Memory: pass results to next stage WB - back: write sum (ALUres) into register file 3

Hazard Register one is written add $,$2,$3 IF ID EXE MEM WB sub $,$5,$ IF ID EXE MEM WB Register one is read If we are not careful, we will read the wrong value! If sub is supposed to read updated value (not stale), how many instruction should be in between add and sub? 32

Shift Left 2 sub $,$5,$ add $,$2,$3 pc Instruction R0 R2 R R6 R8 0 8 5 8 6 7 9 R R3 R5 R7 R9 8 3 Extension 33

Hazard write add $,$2,$3 IF ID EXE MEM WB sub $,$5,$ IF hazard hazard ID EXE MEM WB read 3

Class work What are the data hazards in this piece of code? add $,$2,$3 sub $2,$,$3 xor $,$3,$5 nor $5,$2,$ add $5,$3,$5 35

What to do with them? Avoid Make sure there are no hazards in the code Detect and Stall If hazards exist, stall the processor until they go away. Detect and Forward If hazards exist, fix up the pipeline to get the correct value (if possible) 36

First Approach: avoid all hazards Assume the programmer (or the compiler) knows about the processor implementation. Make sure no hazards exist. Consider if I have an instruction called noop. Put noops between any dependent instructions. add $,$2,$3 noop noop sub $,$5,$ IF ID EXE MEM WB IF ID EXE MEM WB 37

What is the problem with this solution? Old programs (legacy code) may not run correctly on new implementations Longer pipelines need more noops Programs get larger as noops are included Especially a problem for machines that try to execute more than one instruction every cycle Intel EPIC: Often 25% - 0% of instructions are noops Program execution is slower CPI is, but some instructions are noops 38

The second solution Detect: Compare rega with previous DestRegs 5 bit operand fields Compare regb with previous DestRegs Stall: 5 bit operand fields Keep current instructions in fetch and decode Pass a noop to execute 39

Shift Left 2 pc Instruction PC+ instruction PC+ vala valb target ALUres eq? mdata Extension IMM dest valb dest ALUres dest 0

Shift Left 2 pc Instruction PC+ instruction PC+ vala valb target ALUres eq? mdata Extension IMM dest valb dest ALUres dest

Shift Left 2 pc Instruction PC+ instruction PC+ vala valb target ALUres eq? mdata Extension IMM valb ALUres 2

Hazard write Addr 0x00 add $,$2,$3 IF ID EXE MEM WB sub $,$5,$ IF hazard hazard ID EXE MEM WB read 3

0 Shift Left 2 First half of cycle 0x0 0x00 Instruction PC+ 0 5 6 2 3 PC+ vala valb target ALUres eq? mdata Extension IMM valb ALUres add $,$2,$3

Shift Left 2 Second half of cycle add $,$2,$3 0x0 Instruction 0x0 add $,$2,$3 0 5 6 2 3 PC+ vala valb target ALUres eq? mdata Extension IMM valb ALUres 5

0 Shift Left 2 First half of cycle 2 0x08 0x0 Instruction 0x0 add $,$2,$3 add $,$2,$3 2 3 0 5 6 2 3 6 PC+ vala valb target ALUres eq? mdata Extension IMM valb ALUres sub $,$,$5 6

0 Shift Left 2 Second half of cycle 2 add $,$2,$3 0x08 Instruction 0x08 sub $,$,$5 2 3 0 5 6 2 3 0x0 6 target ALUres eq? mdata Extension IMM valb ALUres sub $,$,$5 7

0 Shift Left 2 First half of cycle 3 0x0c add $,$2,$3 0x08 Instruction 0x08 sub $,$,$5 Hazard detected 5 0 5 6 2 3 0x0 6 target ALUres eq? mdata Extension IMM 7 valb ALUres sub $,$,$5 8

Hazard detected compare compare compare compare rega regb REG file IF/ ID 9 ID/ EX

Hazard detected compare 0000 50 0000 rega regb

What Next? Detect: Compare rega with previous DestRegs 5 bit operand fields Compare regb with previous DestRegs Stall: 5 bit operand fields Keep current instructions in fetch and decode Pass a noop to execute 5

0 Shift Left 2 Second half of cycle 3 0x0c add $,$2,$3 0x08 Instruction 0x08 sub $,$,$5 Hazard detected 5 0 5 6 2 3 0x0 7 eq? mdata Extension valb ALUres sub $,$,$5 noop 52

0 Shift Left 2 First half of cycle 0x0c add $,$2,$3 0x08 Instruction 0x08 sub $,$,$5 Hazard detected 5 0 5 6 2 3 0x0 6 0x0 7 eq? mdata sub $,$,$5 Extension IMM noop valb 7 ALUres 53

0 Shift Left 2 Second half of cycle 0x0c add $,$2,$3 0x08 Instruction 0x08 sub $,$,$5 Hazard detected 5 0 5 6 2 3 eq? 0x0 mdata 7 Extension sub $,$,$5 noop noop 5

0 Shift Left 2 first half of cycle 5 0x0c 0x0c Instruction 0x08 sub $,$,$5 5 0 7 6 2 3 eq? 0x0 mdata add $,$2,$3 sub $,$,$5 Extension noop noop 55

0 Shift Left 2 second half of cycle 5 0x0c 0x08 Instruction 0 7 6 2 3 7 3 eq? mdata Extension sub $,$,$5 noop noop 56

Timing graph Time: 2 3 5 6 7 8 9 0 2 3 add $,$2,$3 IF ID EX ME WB Sub $,$,$5 IF no op no op ID EX ME WB add $6,$,$7 IF ID EX ME WB lw $6,0($8) IF ID EX ME WB sw $6,3($) IF no op no op ID EX ME 57

Problems with the second solution Still CPI is the same as before, no improvement in performance The only improvement is in the code size, and no longer compiler is responsible to detect the data hazards In fact, now the system runs slower Why? 58

Detect the data hazard The third solution Add instruction calculated the result in the execution cycle Forward the result to the decode stage of the sub instruction Therefore sub does not need to wait until the result is written back into register file And more control is needed; place the result somewhere else rather than register file 59

The third solution Detect: same as detect and stall Except that all hazards are treated differently Forward: i.e., you can t logical-or the hazard signals New bypass datapaths route computed data to where it is needed New MUX and control to pick the right data Beware: Stalling may still be required even in the presence of forwarding 60

Shift Left 2 First half of cycle 3 sub $,$,$5 add $,$2,$3 pc Instruction PC+ sub $,$,$5 Hazard detected 5 0 5 6 2 3 PC+ 6 7 target ALUres eq? mdata Extension IMM valb ALUres FW FW FW add $6,$,$7 6

Shift Left 2 End of cycle 3 sub $,$,$5 add $,$2,$3 pc Instruction PC+ Add $6,$,$7 0 5 6 2 3 7 9 Extension PC+ 5 3 IMM target 7 eq? valb mdata ALUres FW FW H add $6,$,$7 62

Shift Left 2 First half of cycle add $6,$,$7 sub $,$,$5 add $,$2,$3 pc Instruction PC+ Add $6,$,$7 New Hazard 7 0 5 6 2 3 7 9 Extension PC+ 5 3 IMM target 7 eq? valb 7 mdata ALUres lw $6,0($8) H FW FW 63

Shift Left 2 End of cycle add $6,$,$7 sub $,$,$5 add $,$2,$3 pc Instruction PC+ lw $6,0($8) 6 0 5 6 2 3 7 9 2 PC+ 5 9 target eq? valb mdata Extension IMM 7 lw $6,0($8) H2 H FW 6

Shift Left 2 pc Instruction PC+ lw $6,0($8) First half of cycle 5 lw $6,0($8) New Hazard 6 0 7 6 2 3 7 9 2 add $6,$,$7 PC+ 5 9 sub $,$,$5 target 6 eq? valb mdata add $,$2,$3 Extension IMM sw $6,3($) H2 H FW 65

What else can go wrong in our pipelined CPU? Control hazards Exceptions: First of all, what are exceptions? And, how do you handle exceptions in a pipelined processor with 5 instructions in flight?

Control Hazard What is a control hazard? How does the pipelined CPU handle control hazards?

Shift Left 2 beq bne pc Instruction PC+ PC+ vala valb target ALUres eq? mdata Extension IMM ALU Unit valb ALUres Control Unit

What happens in executing BEQ? Fetch: read instruction from memory Decode: read source operands from reg Execute: calculate target address and test for equality Memory: Send target to PC if test is equal back: Nothing left to do

Example y=y*2; x=0; for(j=00;j>0;j--){ x++; z--; } y--; x=x*3; z=z+x; 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,2 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2

What do you observe from the example? How many times the branch is taken? How many times is not taken? What happens each time that the branch instruction is executed? What happens next?

Surprise! 2 addi $2,$2,... 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 2 IF ID EXE MEM WB 28 IF ID EXE MEM WB 32 IF ID EXE MEM WB 36 IF ID EXE MEM WB 2 IF ID EXE MEM WB

Solutions Avoid Make sure there are no hazards in the code Detect and Stall Delay fetch until branch resolved. Speculate and Squash-if-Wrong Go ahead and fetch more instruction in case it is correct, but stop them if they shouldn t have been executed

Avoid Don t have branch instructions! Maybe a little impractical Delay taking branch: dbeq R,R2,offset dbne R,R2,offset Instructions at PC+, PC+8, etc will execute before deciding whether to fetch from PC++offset. (If no useful instructions can be placed after dbeq, noops must be inserted.)

Consider our example again 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 noop 32 noop 36 noop 0 addi $3,$3,- add $5,$2,$0 8 add $2,$2,$2 52 add $2,$2,$5 56 add $,$,$2

Can we do better? 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $5,$5,- 6 dbne $5,$0,-2 20 addi $,$,- 2 addi $2,$2, 28 noop 32 addi $3,$3,- 36 add $5,$2,$0 0 add $2,$2,$2 add $2,$2,$5 8 add $,$,$2 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 dbne $5,$0,- 6 addi $5,$5,- 20 addi $,$,- 2 addi $2,$2, 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2 This code generates wrong results.

Problems with this solution Old programs (legacy code) may not run correctly on new implementations Longer pipelines need more instuctions/noops after delayed beq Programs get larger as noops are included Especially a problem for machines that try to execute more than one instruction every cycle Intel EPIC: Often 25% - 0% of instructions are noops Program execution is slower CPI equals, but some instructions are noops

Detect and Stall (hardware approach) Detection: Must wait until decode Compare opcode to beq Alternately, this is just another control signal Stall: Keep current instructions in fetch Pass noop to decode stage (not execute!)

Our example again 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2

Shift Left 2 28 Instruction 28 bne $5,$0,- PC+ vala valb target ALUres Eq? mdata Extension IMM ALU Unit valb ALUres bne $5,$0,- Control Unit

Shift Left 2 bne $5,$0,- pc Instruction 28 noop 28 0 target ALUres eq mdata 0 Extension IMM ALU Unit valb ALUres Control Unit

Shift Left 2 bne $5,$0,- pc Instruction 28 noop 28 vala valb target 0 eq mdata Extension IMM ALU Unit valb ALUres Control Unit noop

Shift Left 2 28 2 bne $5,$0,- pc Instruction 28 noop 28 vala valb target 0 eq mdata Extension IMM ALU Unit valb ALUres Control Unit noop noop

Shift Left 2 pc Instruction 6 addi $2,$2, 28 vala valb target 0 eq mdata Extension IMM ALU Unit valb ALUres addi $2,$2, Control Unit noop noop noop

What seems to be the problem? CPI increases every time a branch is detected! Is that necessary? Not always! Only about ½ of the time is the branch taken Let s assume that it is NOT taken In this case, we can ignore the beq or bne (treat them like a noop) Keep fetching PC + What if we are wrong? OK, as long as we do not COMPLETE any instructions we mistakenly executed (i.e. don t perform writeback)

Speculate and Squash Speculate: assume not equal Keep fetching from PC+ until we know that the branch is really taken Squash: stop bad instructions if taken Send a noop to: Decode, Execute and Memory Send target address to PC

Our example again 00 add $3,$3,$3 0 add $2,$0,$0 08 li $5,00 2 addi $2,$2, 6 addi $,$,- 20 addi $5,$5,- 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 0 add $2,$2,$5 add $,$,$2

Shift Left 2 pc Instruction PC+ noop PC+ vala valb target ALUres eq? mdata 2 bne $5,$0,- 28 addi $3,$3,- 32 add $5,$2,$0 36 add $2,$2,$2 Extension Control Unit IMM noop ALU Unit valb noop ALUres

Performance problem, again CPI increases every time a branch is taken! About ½ of the time Is that necessary? No!, but how can you fetch from the target before you even know the previous instruction is a branch much less whether it is taken???

Shift Left 2 28 Instruction 28 2 bne $5,$0,- PC+ vala valb target ALUres Eq? mdata bpc target Extension IMM ALU Unit valb ALUres 2 bne $5,$0,- Control Unit

Shift Left 2 28 Instruction 28 PC PC+ PC vala valb target ALUres Eq? 2 mdata bpc target 2 2 Extension IMM ALU Unit valb ALUres Control Unit 2 bne $5,$0,-

Shift Left 2 eq? 28 Instruction 28 PC PC+ PC vala valb target ALUres Eq? PC mdata bpc target 2 2 Extension IMM ALU Unit valb ALUres 2 bne $5,$0,- Control Unit

Branch Prediction Predict not taken: ~50% accurate Predict backward taken: ~65% accurate Predict same as last time: ~80% accurate Pentium: ~85% accurate Pentium Pro: ~92% accurate Best paper designs: ~96% accurate