Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

Photo David Wright https://www.flickr.com/photos/dhwright/3312563248 STEVEN R. BAGLEY PIPELINES AND ILP

INTRODUCTION Been considering what makes the CPU run at a particular speed Spent the last two weeks looking at memory latency And how caching can help speed things up, by reducing the time to fetch instructions and data Today, look at other tricks used by CPU designers to make the run fast

FETCH DECODE EXECUTE FETCH DECODE EXECUTE FETCH DECODE EXECUTE FETCH DECODE EXECUTE CLOCK CYCLE Sets the minimum time any instruction will take to run as 3 cycles (one cycle for each stage).

INST 1 INST 2 INST 3 INST 4 INST 5 INST 6 INST 7 INST 8 INST 9 INST 10 INST 1 INST 2 INST 3 INST 4 INST 5 INST 6 INST 7 INST 8 INST 9 INST 10 INST 11 INST 1 INST 2 INST 3 INST 4 INST 5 INST 6 INST 7 INST 8 INST 9 INST 10 INST 11 INST 12 CLOCK CYCLE KEY: FETCH DECODE EXECUTE Once we get going, we finish execution of an instruction ever clock cycle

BUBBLES Pipeline hazards introduce bubbles into the pipeline Points in time where the CPU isn t executing an instruction because the hazard forced the delay of an earlier stage Also known as a pipeline stall Size of the bubble depends on the instructions In the worst case, we can end up with instructions effectively executing serially Can rewrite our code to be more pipeline friendly by reordering the instructions

CONTROL HAZARD Branches cause another form of pipeline hazard a Control Hazard When the proper instruction cannot execute in the next clock cycle because a different instruction was fetched With a conditional branch, cannot know until the branch is execute whether you ll get a control hazard You might have fetched the right instruction, you might not AKA a branch hazard

CONTROL HAZARD In our case, the unconditional branch means we definitely haven t fetched the correct instruction Need to discard the currently fetched and decoded instructions and start again Causes a stall as long as the pipeline Not just branches, any instruction which alters the CPU

B _CMP CMP BLE CLOCK CYCLE SWI 4 SWI 4 SWI 2 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 0x0C main 0x10 LDR R0, _a LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Fetch the correct instruction (a CMP R0,R1) in this case it happens to be the same instruction but could just be any instruction Pipeline then continues as before Until we reach another branch, when the same thing happens although in this case it is a conditional branch so we might be in a position where the condition matches

MITIGATING CONTROL HAZARD Control Hazards introduce a bubble that is as one stage less than the pipeline It s possible to design the CPU instruction set to mitigate this in some circumstances Can lead to some interesting instruction sets Execute instruction after branch In this case, the pipeline is 3 cycles, so the stall is 2 cycles long

CONDITIONAL INSTRUCTIONS ARM designers took a different approach Realised that some branches only happen to skip one or two instructions Decided to make every instruction conditional (not just branches) Any ARM instruction can have a condition code placed on it Instruction is only executed if the condition is met Means we only have a one-cycle bubble (in the execute phase) As in our euclid example Show how we can rewrite our Euclid example in three lines using this

PIPELINE LENGTH Pipeline length depends on the implementation of the CPU For example, the MIPS CPU has a five stage pipeline Instruction Fetch (from memory) (IF) Decode and read values from registers (ID) Execute operation or calculate address (i.e. use ALU) (EX) Access operand in data memory (MEM) Write back result into registers (WB) Instruction set is designed to allow this to happen

PIPELINE LENGTH As pipeline broken down into smaller steps The steps do less and take less time to run So can run faster But the cost of a stall (e.g. for a branch) becomes much greater More types of hazards can appear Another common hazard is the data hazard Pentium 4 had a 20 stage pipeline, a branch stall would take several clock cycles

DATA HAZARD Data hazard occurs when an instruction needs a value That hasn t yet been calculated by a previous instruction Take the following ARM code ADD R0,R1,R2 SUB R2,R0,#5 Second instruction cannot begin executing until the value for R0 is calculated Now lets look at how this would play out in a MIPS like pipeline

ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 CLOCK CYCLE ADD instruction doesn t update the register R0 until WB phase SUB needs r0 instruction before it is updated So do we need to stall CPU and shift the final stages of the SUB until after the WB?

ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 IF ID EX MEM WB CLOCK CYCLE ADD instruction doesn t update the register R0 until WB phase SUB needs r0 instruction before it is updated So do we need to stall CPU and shift the final stages of the SUB until after the WB?

R0 UPDATED ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 IF ID EX MEM WB CLOCK CYCLE ADD instruction doesn t update the register R0 until WB phase SUB needs r0 instruction before it is updated So do we need to stall CPU and shift the final stages of the SUB until after the WB?

R0 UPDATED ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 IF ID EX MEM WB CLOCK CYCLE R0 FETCHED BY SUB ADD instruction doesn t update the register R0 until WB phase SUB needs r0 instruction before it is updated So do we need to stall CPU and shift the final stages of the SUB until after the WB?

MITIGATING DATA HAZARDS Can use an approach called forwarding or bypassing to mitigate a data hazard Rather than have the instruction wait for the data to be written back We provide a short cut from the internal buffers in the CPU to provide the data Rather than needing to fetch it from the register

R0 VALUE CALCULATED ADD R0, R1, R2 IF ID EX MEM WB SUB R2,R0,#5 IF ID EX MEM WB CLOCK CYCLE R0 NEEDED BY SUB Add instruction calculates the value of R0 in EX phase Sub doesn t need it till the EX phase so we provide a short cut in the CPU design to get the value into the right place

PIPELINES All pipeline stages must take the same amount of time to complete Or rather the longest step will define the time each step of the pipeline will take to run Doesn t matter if a step completes early We can design our instruction set to help this

DESIGNING INSTRUCTION SETS FOR PIPELINE Helps if all instructions are the same length Means the instruction fetch always takes the same amount of time Also helps if there is regularity in the bit patterns use to express instructions E.g. the bits for a register are in the same place for each instruction Separating memory access for other instructions Compare ARM where each instruction is 4 bytes With x86 where instructions varies from 1 to 16 bytes (And the length isn t known until you start decoding it Modern CPU translate x86 instructions into RISC like instructions internally

BRANCH PREDICTION Control Hazards happen when the CPU has started to fetch the wrong instruction Instructions pass through the early stages of the pipeline But not needed so work gets thrown away And CPU has to start again and fetch the correct instruction

B _CMP CMP BLE CLOCK CYCLE SWI 4 SWI 4 SWI 2 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 0x0C main 0x10 LDR R0, _a LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Here we had started to fetch SWI 4 and SWI 2 after our Branch that we don t need So next instruction has to fetch the correct CMP instruction causing a stall

0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 B _CMP CMP BLE CLOCK CYCLE SWI 4 SWI 4 SWI 2 CMP 0x0C main 0x10 LDR R0, _a LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Here we had started to fetch SWI 4 and SWI 2 after our Branch that we don t need So next instruction has to fetch the correct CMP instruction causing a stall

BRANCH PREDICTION Our CPU is using a very naive approach to fetching the next instruction Always fetches the next one linearly in memory But with loops this is almost always going to be the wrong instruction Loop will usual happen several times And only the last iteration does the next instruction in memory get executed Surely it d make more sense to assume the branch was taken? Makes the pipeline construction more complex but doable

0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 B _CMP CMP BLE CLOCK CYCLE 0x0C main LDR R0, _a 0x10 LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1

0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 B _CMP CMP BLE CLOCK CYCLE BLE 0x0C main LDR R0, _a 0x10 LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1 and then BLE and move these up the stages But what do we fetch here?

B _CMP CMP BLE CLOCK CYCLE BLE 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 0x0C main 0x10 LDR R0, _a LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1 and then BLE and move these up the stages But what do we fetch here?

B _CMP CMP BLE CLOCK CYCLE BLE BLE 0x00 B main 0x04 _a DEFW 163 0x08 _b DEFW 173 0x0C main 0x10 LDR R0, _a LDR R1, _b ; euclid routine goes here 0x14 B _cmp 0x18 loop CMP R0, R1 0x1C BLE skiptoelse 0x20 SUB R0, R0, R1 0x24 B end skiptoelse 0x28 SUB R1, R1, R0 end 0x2C _cmp CMP R0, R1 0x30 BNE loop ; R0 (and R1) contain result 0x34 SWI 4 0x38 SWI 2 Assuming the branch is taken Rather than fetching SWI 4 here, we d want to fetch CMP R0,R1 and then BLE and move these up the stages But what do we fetch here?

BRANCH PREDICTION Relatively easy to predict which way a loop will branch (i.e. to loop) However, for branches used to implement a conditional statement it is much harder Which is the best path to take by default? Need the CPU to be able to predict the way the branch will happen

BRANCH PREDICTION CPU uses the past to predict how a branch will be taken Keeps track of how many times it branched and how many times it didn t For the branch instructions it has seen recently Uses these statistics to work which instruction is the best one to predict Requires considerable logic to implement

SPECULATIVE EXECUTION Branch prediction is an example of speculative execution CPU is doing some work on the assumption that it ll probably be needed But it might also end up being thrown away Depending on the pipeline design this could get as far as actually calculating results

INSTRUCTION-LEVEL PARALLELISM Pipelining speeds up the CPU by enabling many instructions to execute at once Known as Instruction-level parallelism Largely invisible to the programmer But limited in the amount of parallelism we can exploit Due to the structure of the CPU data path Although if you know how things work you can construct code to benefits

Also saw how the data flows through the CPU Highlight how data flows

SUPERSCALAR But what if we built the CPU with more than one ALUs CPU could perform two additions at the same time Would be able to execute two instructions at the same time CPU designed like this is described as superscalar Can get the time taken to execute an instruction to less than one CPU Certain instructions

SUPERSCALAR CPU fetches two instructions in one clock cycle CPU decodes two instructions in one clock cycle CPU executes two instructions in one clock cycle Result is that each instruction appears to complete in 0.5 clock cycles Where possible Not possible if the second instruction depends on the output of the first Or the first is a branch

APPLE A8 CPU Taken from analysis at http://www.anandtech.com/show/7910/apples-cyclone-microarchitecture-detailed Several different data paths that instructions can take through the CPU Not all equal up to the control logic to make sure the instruction follows the correct path

IN-ORDER CPU we have considered would be described as being in-order Executes the instructions in the order they appear in memory Program needs to be written to ensure a superscalar CPU can execute the instructions in parallel Up to the programmer/compiler to design the code carefully to get the best order Problem the best order varies from CPU implementation to implementation Works ok in some applications

LDR R0,_a LDR R1,_b ADD R0,R0,#5 Saw a situation like this earlier on our simple CPU Caused a stall because we can t fetch the ADD instruction until after LDR R1,_b has completed executing But if we reorder the instructions Same effect but we reduce the stall to one cycle

LDR R0,_a LDR R1,_b ADD R0,R0,#5 LDR R0,_a ADD R0,R0,#5 LDR R1,_b Saw a situation like this earlier on our simple CPU Caused a stall because we can t fetch the ADD instruction until after LDR R1,_b has completed executing But if we reorder the instructions Same effect but we reduce the stall to one cycle

OUT-OF-ORDER EXECUTION Some CPUs however go one step further Will reorder the instructions to execute them in the best manner for the CPU design Known as out-of-order execution Lots of tricks used to implement this e.g. register renaming

MULTI-CORE These kind of tricks can only get us so far Require a lot of logic to implement The alternative is to have lots of separate CPU cores And rewrite our programs to run in parallel But that brings its own issues