Processor Design Pipelined Processor. Hung-Wei Tseng

Processor Design Pipelined Processor Hung-Wei Tseng

Pipelining 7

Pipelining Break up the logic with isters into pipeline stages Each stage can act on different instruction/data States/Control signals of instructions are hold in isters...... latch latch 8

Pipelining cycle # cycle #2 cycle #3 cycle #4 cycle #5 After the 5th cycle, the processor can do 5 instructions in parallel 9

Pipelining cycle #6 cycle #7 cycle #8 cycle #9 cycle # The processor can complete instruction each cycle CPI == if everything works perfectly!

Single-cycle v.s. pipeline v.s.

Cycle time of a pipeline processor Critical path is the longest possible delay between two registers in a design. The critical path sets the cycle time, since the cycle time must be long enough for a signal to traverse the critical path. change performance Lengthening or shortening non-critical paths does not Ideally, all paths are about the same length 3

Designing a 5-stage pipeline processor for MIPS 5

Basic steps of execution Instruction fetch: where? instruction memory Decode: What s the instruction? Where are the operands? registers Execute ALUs Memory access data memory Where is my data? Where to put the result Processor Write back registers 8bf94: 8 8 Determine the next PC 8bf98: c3 277952 8bf9c: 8 8 6 instruction memory ALU PC R R R2... R3 registers 27a3: fbb27 ldah gp,5(t2) 27a34: 59cbd23 lda gp,-2552(gp) 27a38: 5d24 ldah t,(gp) 27a3c: bd24 ldah t4,(gp) 27a4: 2ca422a ldl t,-2358(t) 27a44: 32e4 beq t,27a94 27a48: 3d24 ldah t,(gp) 27a4c: 2ca4e2b3 stl zero,-2358(t) 8bf94: 8 8 8bf98: c2f 2775424 8bf9c: 8 8 8bf9: c2f8 2777472 data memory 8bf9: c2e8 2773376

Pipeline a MIPS processor Instruction Fetch from instruction memory Decode Instruction Fetch () Figure out the incoming instruction? Instruction Decode () Fetch the operands from the registers Execution Perform ALU functions Memory access /write data memory Write back results to registers Write to the register file Execution () Memory Access () Write Back () 7

PC From single-cycle to pipeline Instruction Fetch Instruction Decode Execution PCSrc = Branch & Zero PCSrc Memory Access Write Back Control 4 Address Add Instruc(on Memory inst[3:] inst[3:25],inst[5:] RegWrite inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg inst[5:] RegDst Data 2 Write Data 6 signextend 32 ALUSrc Shi> le> 2 Zero ALU ALUop Add Address MemWrite Write Data Data Memory Mem Data MemtoReg / /EX EX/ / Will this work? 8

PC Pipelined processor PCSrc Control 4 Address Add Instruc(on Memory add $, $2, $3 lw $4, ($5) sub $6, $7, $8 sub $9,$,$ sw $, ($2) inst[3:] inst[3:25],inst[5:] RegWrite inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg inst[5:] RegDst Data 2 Write Data 6 signextend 32 ALUSrc Shi> le> 2 Zero ALU ALUop Add Address MemWrite Write Data Data Memory Mem Data MemtoReg / /EX EX/ / 9

PC Pipelined processor PCSrc 4 Address Add Instruc(on Memory add $, $2, $3 lw $4, ($5) sub $6, $7, $8 sub $9,$,$ sw $, ($2) inst[3:] Control inst[3:25],inst[5:] RegWrite inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg inst[5:] RegDst Data 2 Write Data 6 signextend 32 ME EX ALUSrc Shi> le> 2 Zero ALU ALUop Add Address MemWrite Write Data Data Memory Mem Data MemtoReg / /EX EX/ / Where can I find these? ME 2

PC Pipelined processor PCSrc Is this right? RegWrite 4 Address Add Instruc(on Memory add $, $2, $3 lw $4, ($5) sub $6, $7, $8 sub $9,$,$ sw $, ($2) inst[3:] Control inst[3:25],inst[5:] RegWrite inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg inst[5:] Data 2 RegDst Write Data 6 signextend 32 ME EX ALUSrc Shi> le> 2 Zero ALU ALUop Add Address MemWrite Write Data Data Memory Mem Data MemtoReg / /EX EX/ / ME 23

PC Pipelined processor 4 PCSrc Address Add Instruc(on Memory inst[3:] / /EX EX/ / inst[5:] Control inst[3:25],inst[5:] RegWrite inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg Data 2 Write Data 6 signextend 32 ME EX ALUSrc Shi> le> 2 RegDst Zero ALU ALUop Add ME Address MemWrite Write Data Data Memory Mem Data RegWrite MemtoReg 24

PC 5-stage pipelined processor 4 PCSrc Address Add Instruc(on Memory inst[3:] / /EX EX/ / inst[5:] Control inst[3:25],inst[5:] RegWrite inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg Data 2 Write Data 6 signextend 32 ME EX ALUSrc Shi> le> 2 RegDst Zero ALU ALUop Add ME Address MemWrite Write Data Data Memory Mem Data RegWrite MemtoReg 25

Simplified pipeline diagram Use symbols to represent the physical resources with the abbreviations for pipeline stages.,,,, Horizontal axis represent the timeline, vertical axis for the instruction stream Example: add $, $2, $3 lw $4, ($5) sub $6, $7, $8 sub $9,$,$ sw $, ($2) 26

Pipeline hazards 28

Pipeline hazards Even though we perfectly divide pipeline stages, it s still hard to achieve CPI ==. Pipeline hazards: Structural hazard The hardware does not allow two pipeline stages to work concurrently Data hazard A later instruction in a pipeline stage depends on the outcome of an earlier instruction in the pipeline Control hazard The processor is not clear about what s the next instruction to fetch 29

Can we get the right result? Given the current 5-stage pipeline, how many of the following MIPS code can work correctly? a: b: c: d: e: add $, $2, $3 lw $4, ($) sub $6, $7, $8 sub $9,$,$ sw $, ($2) I II III IV add $, $2, $3 lw $4, ($5) sub $6, $7, $8 sub $9, $, $ sw $, ($2) add $, $2, $3 lw $4, ($5) bne $, $7, L sub $9,$,$ sw $, ($2) add $, $2, $3 lw $4, ($5) sub $6, $7, $8 sub $9,$,$ sw $, ($2) b cannot get $ produced by a before Data hazard both a and d are accessing $ at 5th cycle Structural hazard We don t know if d & e will be executed or not Control hazard 3

Structural hazard 3

Structural hazard The hardware cannot support the combination of instructions that we want to execute at the same cycle two instructions competing the same register. The original pipeline incurs structural hazard when Solution: write early, read late Writes occur at the clock edge and complete long enough before the end of the clock cycle. This leaves enough time for outputs to settle for reads The revised register file is the default one from now! add $, $2, $3 lw $4, ($5) sub $6, $7, $8 sub $9,$, $ sw $, ($2) 33

Structural hazard The design of hardware causes structural hazard We need to modify the hardware design to avoid structural hazard 35

Data hazard 36

Data hazard When an instruction in the pipeline needs a value that is not available Data dependences The output of an instruction is the input of a later instruction May result in data hazard if the later instruction that consumes the result is still in the pipeline 38

Sol. of data hazard I: Stall When the source operand of an instruction is not ready, stall the pipeline Suspend the instruction and the following instruction Allow the previous instructions to proceed This introduces a pipeline bubble: a bubble does nothing, propagate through the pipeline like a nop instruction Disable the PC update How to stall the pipeline? Disable the isters on the earlier pipeline stages When the stall is over, re-enable the isters, PC updates 4

PC PCWrite PCSrc 4 Address Hazard detection & stall hazard detection unit Add Instruc(on Memory /Write inst[3:] Check if the destination register of EX == source register of the instruction in / /EX EX/ / inst[5:] Control inst[3:25],inst[5:] inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg Data 2 Write Data 6 RegWrite /EX.Mem signextend 32 ME EX ALUSrc Shi> le> 2 RegDst Zero ALU ALUop Insert a noop if we need to stall Add ME Address MemWrite Write Data Data Memory Mem Data RegWrite MemtoReg Check if the destination register of == source register of the instruction in 4

Performance of stall Insert a noop in stage Insert another noop in stage, previous noop goes to stage add $, $2, $3 lw $4, ($) sub $5, $2, $4 sub $, $3, $ sw $, ($5) 5 cycles! CPI == 3 (If there is no stall, CPI should be just!) 42

Sol. of data hazard II: Forwarding The result is available after and stage, but publicized in! The data is already there, we should use it right away! Also called bypassing add $, $2, $3 lw $4, ($) sub $5, $2, $4 sub $, $3, $ sw $, ($5) We can obtain the result here! 43

Sol. of data hazard II: Forwarding Take the values, where ever they are! add $, $2, $3 lw $4, ($) sub $5, $2, $4 sub $, $3, $ sw $, ($5) cycles! CPI == 2 (Not optimal, but much better!) 44

When can/should we forward data? If the instruction entering the stage consumes a result from a previous instruction that is entering stage or stage A source of the instruction entering stage is the destination of an instruction entering / stage The previous instruction must be an instruction that updates register file 46

PC 4 PCSrc Address Forwarding in hardware Add Instruc(on Memory inst[3:] / /EX EX/ / inst[5:] revious instruction (Ins#) urernt instruction (Ins#2) How about load? Control inst[3:25],inst[5:] RegWrite inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg Data 2 Write Data 6 Rs of Ins#2 Rt of Ins#2 signextend 32 ME EX Control of Ins#2 ForwardA Shi> le> 2 RegDst ForwardB ForwardA ForwardB Zero ALU ALUop Add forwarding unit ALUSrc ME Control of Ins# Address MemWrite Write Data Data Memory Mem Data MemtoReg 47 RegWrite ALU result of Ins# destination of Ins#

PC 4 PCSrc Address Forwarding in hardware Add Instruc(on Memory inst[3:] / /EX EX/ / inst[5:] Control inst[3:25],inst[5:] RegWrite inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg Data 2 Write Data 6 signextend 32 ME EX ForwardA Shi> le> 2 RegDst ForwardB Zero ALU ALUop Add ME ALU/ result of Ins# Control of Ins# Address MemWrite Write Data Data Memory Mem Data RegWrite MemtoReg ForwardA ForwardB forwarding unit ALUSrc Rd of Ins# 48

There is still a case that we have to stall... Revisit the following code: add $, $2, $3 lw $4, ($) sub $5, $2, $4 sub $, $3, $ sw $, ($5) lw generates result at stage, we have to stall If the instruction entering stage depends on a load instruction that does not finish its stage yet, we have to stall! We call this hazard detection We need to know the following:. If an instruction in EX/ updates a register (RegWrite) 2. If an instruction in EX/ reads memory (Mem) 3. If the destination register of EX/ is a source of /EX (rs, rt of /EX == rt of EX/ #) 49

PC Hazard detection with forwarding hazard detection unit PCWrite PCSrc 4 Address Add Instruc(on Memory /Write inst[3:] / /EX EX/ / inst[5:] Control inst[3:25],inst[5:] inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg Data 2 Write Data 6 RegWrite /EX.Mem signextend 32 ME EX ForwardA Shi> le> 2 RegDst ForwardB Zero ALU ALUop Add ME Address MemWrite Write Data Data Memory Mem Data RegWrite MemtoReg ForwardA ForwardB forwarding unit ALUSrc 5

Control hazard 5

Control hazard The processor cannot determine the next PC to fetch LOOP: lw $t3, ($s) addi $t, $t, add $v, $v, $t3 addi $s, $s, 4 bne $t, $t, LOOP lw $t3, ($s) stall 7 cycles per loop 54

Reducing the overhead of control hazards 55

Solution I: Delayed branches An agreement between ISA and hardware Branch delay slots: the next N instructions after a branch are always executed Compiler decides the instructions in branch delay slots Reordering the instruction cannot affect the correctness of the program MIPS has one branch delay slot Good Simple hardware Bad N cannot change Sometimes cannot find good candidates for the slot 56

Solution I: Delayed branches LOOP: lw $t3, ($s) addi $t, $t, add $v, $v, $t3 addi $s, $s, 4 bne $t, $t, LOOP branch delay slot LOOP: lw $t3, ($s) addi $t, $t, add $v, $v, $t3 bne $t, $t, LOOP addi $s, $s, 4 lw $t3, ($s) stall 6 cycles per loop 57

Solution II: always predict not-taken Always predict the next PC is PC+4 LOOP: lw $t3, ($s) addi $t, $t, add $v, $v, $t3 addi $s, $s, 4 bne $t, $t, LOOP sw $v, ($s) add $t4, $t3, $t5 nop nop nop nop nop lw $t3, ($s) If branch is not taken: no stalls! If branch is taken: doesn t hurt! 7 cycles per loop flush the instructions fetched incorrectly 58

PC Solution III: always predict taken PCWrite PCSrc 4 Address Add Instruc(on Memory /Write inst[3:] / /EX EX/ / inst[5:] hazard detection unit Control inst[3:25],inst[5:] inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg Data 2 Write Data 6 RegWrite /EX.Mem signextend 32 ME EX ForwardA Shi> le> 2 RegDst ForwardB Zero ALU ALUop Add ME Address MemWrite Write Data Data Memory Mem Data RegWrite MemtoReg ForwardA ForwardB forwarding unit ALUSrc 6

PC Solution III: always predict taken PCWrite PCSrc 4 Address Add Instruc(on Memory /Write inst[3:] / /EX EX/ / inst[5:] hazard detection unit Control inst[3:25],inst[5:] inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg Data 2 Write Data 6 Shi> le> 2 Add RegWrite signextend 32 /EX.Mem ME EX ForwardA RegDst ForwardB Zero ALU ALUop ME Address MemWrite Write Data Data Memory Mem Data RegWrite MemtoReg Still have to stall cycle ForwardA ForwardB forwarding unit ALUSrc 62

PC Solution III: always predict taken PCWrite PCSrc 4 Address Add Instruc(on Memory /Write inst[3:] / /EX EX/ / inst[5:] hazard detection unit Control inst[3:25],inst[5:] inst[25:2] Reg Register inst[2:6] Reg 2 Data File Write Reg Data 2 Write Data 6 Shi> le> 2 Add RegWrite signextend 32 /EX.Mem ME EX ForwardA RegDst ForwardB Zero ALU ALUop ME Address MemWrite Write Data Data Memory Mem Data RegWrite MemtoReg Branch Target Buffer Consult BTB in fetch stage ForwardA ForwardB forwarding unit ALUSrc 63

PC Branch Target Buffer branch PC target address or target instruction Branch Target Buffer 64

Solution III: always predict taken Always predict taken with the help of BTB LOOP: lw $t3, ($s) addi $t, $t, add $v, $v, $t3 addi $s, $s, 4 bne $t, $t, LOOP lw $t3, ($s) addi $t, $t, add $v, $v, $t3 5 cycles per loop (CPI ==!!!) But what if the branch is not always taken? 65

Dynamic branch prediction 68

-bit counter Predict this branch will go the same way as the result of the last time this branch executed for taken, for not takens PC = x442 x442 x848324 x4464 x848392 Taken! x4578 x8485a x4c x849624 Branch Target Buffer 69

2-bit counter A 2-bit counter for each branch taken Predict taken if the counter value >= 2 If the prediction in taken states, fetch from target PC, otherwise, use PC+4 Taken 3 () not taken taken Taken 2 () PC= x442 Not Taken () taken not taken taken Not Taken () not taken x442 x848324 x4464 x848392 x4578 x8485a Taken! not taken x4c x849624 Branch Target Buffer 7

Performance of 2-bit counter 2-bit state machine for each branch taken for(i = ; i < ; i++) {! sum += a[i]; } Taken 3 () Not Taken () not taken not taken taken taken not taken taken Taken 2 () Not Taken () not taken 9% accuracy! i state predict actual T T 2 T T 3 T T 4-9 T T T NT Application: 8% ALU, 2% Branch, and branch resolved in EX stage, average CPI? +2%*(-9%)*2 =.4 72 +

Make the prediction better Consider the following code: i = ; do { if( i % 3!= ) // Branch Y, taken if i % 3 == a[i] *= 2; a[i] += i; } while ( ++i < ) // Branch X Can we capture the pattern? i branch result Y T X T Y NT X T 2 Y NT 2 X T 3 Y T 3 X T 4 Y NT 4 X T 5 Y NT 5 X T 6 Y T 6 X T 7 Y NT 74

Predict using history Instead of using the PC to choose the predictor, use a bit vector (global history register, GHR) made up of the previous branch outcomes. Each entry in the history table has its own counter. n-bit GHR index = (T, NT, T) 2 n entries history table Taken! 75

Performance of global history predictor Consider the following code: i = ; do { if( i % 3!= ) // Branch Y, taken if i % 3 == a[i] *= 2; a[i] += i; // Branch Y } while ( ++i < ) // Branch X Assume that we start with a 4- bit GHR=, all counters are. Nearly perfect after this i? GHR BHT prediction actual New BHT Y T T X T T Y T NT X T T 2 Y T NT 2 X T T 3 Y T T 3 X T T 4 Y T NT 4 X T T 5 Y NT NT 5 X T T 6 Y T T 6 X T T 7 Y NT NT 7 X T T 8 Y NT NT 8 X T T 9 Y T T 9 X T T 76 Y NT NT

Branch prediction and modern processors 79

Deeper pipeline Higher frequencies by shortening the pipeline stages performance with frequencies Potentially higher power consumption as dynamic/active power = acv 2 f Higher marketing values since consumers usually link If the execution time is better, still consume less energy 8

Case Study 8

Intel Pentium 4 Microarch. 82

Intel Pentium 4 Very deep pipeline: in order to achieve high frequency! (start from.5ghz) 2 stages in Netburst 2 3 4 5 6 7 8 9 TC Nxt IP TC Fetch Drive Alloc Rename Que 3 stages in Prescott Sch 3W (3.6GHz, 65nm) Reference The Microarchitecture of the Pentium 4 Processor Sch 2 Sch 3 Disp 4 Disp 5 RF 6 RF 7 Ex 8 Flgs 9 Br Ck 2 Drive 83

AMD Athlon 64 84

2 stage pipeline AMD Athlon 64 Inst. Addr Decode 2 Inst Mem 3 Inst. Byte Pick 4 5 2 6 Inst. Dbl. & Pack 7 and Pack 8 Dispatch 9 Scheduling Execution D-Cache Address 2 D-cache Access 89W TDP (Opteron 2.2GHz 9nm) 85

Demo revisited Why the sorting the array speed up the code despite the increased instruction count? if(option) std::sort(data, data + arraysize); for (unsigned i = ; i < ; ++i) { int threshold = std::rand(); for (unsigned i = ; i < arraysize; ++i) { if (data[i] >= threshold) sum ++; } } 88

Deep pipelining and data hazards 89

Data hazard revisited How many cycles it takes to execute the following code? Draw the pipeline execution diagram assume that we have full data forwarding. lw $t, ($a) lw $a, ($t) bne $a, $zero, EX 9 cycles 9

Intel s latest SkyLake BPU 32K L Instruction Cache MSROM 4 uops/cycle 6 uops/cycle Decoded Icache (DSB) Instruction Decode Queue (Q,, or micro-op queue) 5 uops/cycle Legacy Decode Pipeline Allocate/Rename/Retire/MoveElimination/ZeroIdiom Port Scheduler Port Port 5 Port 6 Port 2 LD/STA 256K L2 Cache (Unified) Int ALU, Vec FMA, Vec MUL, Vec Add, Vec ALU, Vec Shft, Divide, Branch2 Int ALU, Fast LEA, Vec FMA, Vec MUL, Vec Add, Vec ALU, Vec Shft, Int MUL, Slow LEA Int ALU, Fast LEA, Vec SHUF, Vec ALU, CVT Int ALU, Int Shft, Branch, Port 3 LD/STA Port 4 STD Port 7 STA 32K L Data Cache Good reference for intel microarchitectures: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf 92