COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

COSC4201 Pipelining Prof. Mokhtar Aboelaze York University 1 Instructions: Fetch Every instruction could be executed in 5 cycles, these 5 cycles are (MIPS like machine). Instruction fetch IR Mem[PC] NPC PC+4 In this stage, we fetch the instruction from the memory and increment the program counter to point to the next instruction (all instruction in DLX are 32 bits long) 2

Instructions: Decode reg fetch Instruction decode/register fetch A Regs[IR 6..10 ] B Regs[IR 11..15 ] Imm ((IR 16 ) 16 ##IR 16..31 Also, the instruction is decoded A and B are 2 temporary registers We can do that since we know the location of the 2 source operands before we start decoding 3 4

Instructions: EX/Effective Address Memory ref ALUoutput A + Imm (effective address) Reg-Reg ALU ALUoutput A op B Reg-Immediate ALU ALUoutput A op Imm Branch ALUoutput NPC + Imm; Cond (A op 0) 5 Instruction: Memory Access /Branch Memory reference LMD Mem[ALUoutput] or Mem[Aluoutput] B; Branch If (cond) PC ALUoutput else PC NPC 6

Instruction: Write Back Reg-Reg ALU Regs[IR 16..20 ] ALUoutput Reg Imm ALU Regs[IR 11..16 ] ALUoutput Load Regs[IR 11..15 ] LMD 7 Pipelining Execute billions of instructions, so throughout is what matters All instructions same length, registers located in same place in instruction format, memory operands only in loads or stores Pipelining requires that adding a set of registers one between each pair of pipeline stages to convey values and control information between the stages (Fig. 3.5) 8

Pipelining 9 10

Performance of pipelining Pipelining does not reduce the time to execute the instruction (it actually increases it). It increases the throughput. We can not skip stages anymore, and the pipeline cycle = time for longest cycle longest cycle Example: a machine with 10-ns cycle, it takes 4 cycles for ALU and branches, and 5 for memory (40,20,40%), what is the effect of the pipelining Without execution time = 10*(0.6*4+0.4*5)= 44 ns With pipelining 10+1 (overhead)=11 Speedup = 4 11 Performance of Pipelining Pipelining is an implementation technique where multiple operations on a number of instructions are overlapped in execution. An instruction execution pipeline involves a number of steps, where each step completes a part of an instruction. Each step is called a pipe stage or a pipe segment. Instructions enter the first stage, and proceed till it exit the last stage Throughput the system is measured as the number of instructions completed per second. The time to move an instruction one step down the line is is equal to the machine cycle and is determined by the stage with the longest processing delay. 12

Performance of Pipelining The length of a machine clock cycle is determined by the time required for the slowest pipe stage. An important pipeline design consideration is to balance the length of each pipeline stage. If all stages are perfectly balanced, then the time per instruction on a pipelined machine (assuming ideal conditions with no stalls): Time per instruction on unpipelined machine Number of pipe stages Under these ideal conditions: Speedup from pipelining equals the number of pipeline stages: n, One instruction is completed every cycle, CPI = 1. 13 Pipelining Clock Number Time in clock cycles Instruction Number 1 2 3 4 5 6 7 8 9 Instruction I IF ID EX MEM WB Instruction I+1 IF ID EX MEM WB Instruction I+2 IF ID EX MEM WB Instruction I+3 IF ID EX MEM WB Instruction I +4 IF ID EX MEM WB Time to fill the pipeline DLX Pipeline Stages: IF = Instruction Fetch ID = Instruction Decode EX = Execution MEM = Memory Access First instruction, I Completed WB = Write Back Last instruction, I+4 completed 14

15 Hazards There are situation called hazards that prevents the continuous flow of instructions in the pipe Structural hazards: resource conflicts Data hazards: instruction depends on the results from a previous instruction that is not ready yet. Control hazards: branches (we don t know the address of the next instruction) 16

Performance with Hazards Hazards in pipelines may make it necessary to stall the pipeline by one or more cycles and thus degrading performance from the ideal CPI of 1. CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instruction If pipelining overhead is ignored and we assume that the stages are perfectly balanced then: Speedup = CPI unpipelined (1 + Pipeline stall cycles per instruction) When all instructions take the same number of cycles and is equal to the number of pipeline stages then: Speedup = Pipeline depth / (1 + Pipeline stall cycles per instruction) 17 Pipeline Perormance (hazards) If we think of pipelining as improving the effective clock cycle time, then given the the CPI for the unpipelined machine and the CPI of the ideal pipelined machine = 1, then effective speedup of a pipeline with stalls over the unpipelind case is given by: Speedup = 1 X Clock cycles unpiplined 1 + Pipeline stall cycles Clock cycle pipelined When pipe stages are balanced with no overhead, the clock cycle for the pipelined machine is smaller by a factor equal to the pipelined depth: Clock cycle pipelined = clock cycle unpipelined / pipeline depth Pipeline depth = Clock cycle unpipelined / clock cycle pipelined Speedup = 1 X pipeline depth 1 + pipeline stall cycles per instruction 18

Structural Hazards When we pipeline a machine, the overlapped instruction execution requires pipelining of functional units and duplication of resources to allow all possible combinations of instructions in the pipeline. If a resource conflict arises due to a hardware resource being required by more than one instruction in a single cycle, and one or more such instructions cannot be accommodated, then a structural hazard has occurred, for example: One example is when we have a single memory for both instructions and data. 19 20

Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load Ifetch Instr 1 Instr 2 Instr 3 Instr 4 Reg Ifetch ALU Reg Ifetch DMem ALU Reg Ifetch Reg DMem ALU Reg Reg DMem ALU Reg DMem Reg Structural Hazard 21 22

Next PC 4 Adder Next SEQ PC RS1 Next SEQ PC Zero? MUX Address Instr Cache IF/ID RS2 Reg File ID/EX MUX MUX ALU EX/MEM Data Cache MEM/WB MUX Datapath Imm Sign Extend RD RD RD WB Data Control Path 23 Structural hazard Example EX: data reference is 40% of the instructions, ideal CPI (ignoring hazards)=1. Assume that a machine with structural hazards has a clock that is 1.05 times higher than the clock rate of a machine without structural hazard. Is pipelining faster? Average instruction time = CPI*T c = (1 + 0.4*1) * T ideal /1.05 = 1.3 * T ideal One solution is to provide a separate memory for instructions 24

Data Hazards Pipelining changes the relative timing of the instructions by overlapping their execution. If the timing of read/write accesses to the operands is changed, that might result in incorrect execution Example: DADD DSUB AND OR XOR R1, R2, R3 R4, R1, R5 R6, R1, R7 R8,R1,R9 R10, R1, R11 All the instructions after ADD use the result of the ADD instruction (ready in WB stage) Without proper precautions, SUB will read the old value in R1 SUB, AND instructions need to be stalled for correct execution. 25 Data Hazards X X X 26

Data Hazards XOR instruction works fine, since it reads in cycle 6, data written in cycle 5 OR instruction could be made correct if we assumed that writes to the registers is in the first half of the cycle, reads in the second half. And and SUB reads operands in the cycle before it is written (problem) But is that a real problem? 27 Data Hazards (Forwarding) It is one thing to read operand before it is written, and requesting an operand before it is produced. Results of ADD is written in CC 5, read by SUB in CC 3 BUT, the results of ADD is produced in CC 3, requested by SUB in CC 4 We can use Forwarding 1. The ALU result from EX/MEM register is always fed back to the input of the ALU 2. If the forwarding hardware detects that if the previous ALU op writes to a register that is the source of the current ALU op, use a MUX to choose the fed back value instead of source register We need to forward results not only from previous instruction, but from an instruction that started 3 28 cycles earlier

Forwarding 29 Forwarding NextPC Registers ID/EX mux mux ALU EX/MEM Data Memory MEM/WR Immediate mux 30

Forwarding Consider the following sequence ADD LW SW R1, R2, R3 R4, 0(R1) 12(R1), R4 To prevent stalls, we need to forward the values in R1 and R4 from the pipeline registers to the inputs of the ALU and data memory. We may require a forwarding path from any pipeline register to the input of any functional unit 31 R1 R1 R4 32

Data Hazards Classification A hazard is created whenever there is a dependence between instructions and they are close enough that the overlap caused by pipelining would change the order of an access to an operand. Could arise from the fact that the order writing or reading from a memory location is changed. Memory reference is always kept in order. Data Hazard may be classified as RAW, WAW, and WAR 33 Read After Write RAW Hazard Instruction i before instruction j Also called data dependence by compiler people Happens when j tries to read a source before it is written by i The most common of data hazards Dealt with before 34

Write after Write hazard (WAW) Example: (general hazard in multiprocessors) A= B+C A= 2.0 If the second instruction completes before the first one, we end up with the wrong value in A 35 Write after Write hazard (WAW) Write After Read (WAR) Instr J writes operand before Instr I reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an anti-dependence by compiler writers. This results from reuse of the name r1. Can t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 36

Write after Read (WAR) Hazards Ex SW R1, 0(R2) ADD R2, R3, R4 With variable depth pipeline, the first instruction might completes before the first one reads the R2, resulting in the wrong (new) value read. 37 Data Hazards requiring stalls Time (clock cycles) CC1 CC2 CC3 CC4 CC5 I n s t r. lw r1, 0(r2) sub r4,r1,r6 Ifetch Reg Ifetch ALU Reg DMem ALU Reg DMem Reg O r d e r and r6,r1,r7 or r8,r1,r9 Ifetch Reg Ifetch ALU Reg DMem ALU Reg DMem Reg 38

Data Hazards Requiring Stalls Consider the following code LW SUB AND OR R1,0(R2) R4,R1,R5 R6,R1,R7 R8,R1,R9 LW instruction does not have the data until the end of CC4 SUB needs the data by the beginning of CC4 No way but to stall the pipe for a cycle 39 Data Hazards Requiring Stalls In such a case, we need a hardware, called pipeline interlock, to detect the hazard and stall the pipeline. The interlock detects the hazard, and introduces a bubble stalling the pipeline That might change the forwarding. 40

41 42

Compiler Scheduling for Data Hazards Compiler may try to rearrange the code in order to avoid stalls. For example, avoid generating a code where there is a load followed by the immediate use of the loaded value Example, consider the code segment A=b+c D=e-f Here is two way of generating code 43 EX LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,ra LW Rf,f LW Re,e SUB Rd,Re,Rf SW d,rd Stall Stall LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Re,e SW a,ra SUB Rd,Re,Rf SW d,rd 44

Control for DLX Pipeline The process of letting the instruction move from the ID to EX stage is called issuing the instruction For the DLX integer pipeline, all data hazards are detected before the instruction is issued That reduces the complexity of the control, since interlocks detect the hazard before any change to the state of the machine is done, we don t have to backtrack and restore. 45 Forwarding (Implementation) The following hardware is used to implement forwarding. Forwarding could be done from the ALU or memory (3.17 & 3.19) 46

47 48

Implementing Load Interlock The following comparison is used, change the control portion of ID/IX to zero, and recycle IF/ID registers. Operands Field of ID/IX Load Operand field of IF/ID Reg. Reg. ALU Comparison ID/IX.IR 11..15 = IF/ID.IR 6..10 Load Reg-Reg ALU ID/IX.IR 11..15 = IF/ID.IR 11..15 load Load, Store, ALU Imm ID/IX.IR 11..15 = IF/ID.IR 6..10 49 Control Hazards Control hazards arise from branching, where we don t know the next inst. If a branch is taken then we need the target address, otherwise, it is untaken or fall through The simplest method, is to stall the pipeline until we know the result of the branch That leads to 3 stall cycles 50

Control Hazards Branch instruction IF ID EX MEM WB Branch successor IF stall stall IF ID EX MEM WB Branch successor + 1 IF ID EX MEM WB Branch successor + 2 IF ID EX MEM Branch successor + 3 IF ID EX Branch successor + 4 IF ID Branch successor + 5 IF In order to reduce the branch penalty, we must do two things 1. Find out if the branch is taken or not as soon as possible 2. Compute the taken PC ASAP 51 Control Hazards In MIPS, the branches BEQZ and BNEZ require a comparison with 0, could be done easily by adding a comparator and performing it in the ID cycle. We can also calculate the branch target address during the ID stage by adding another adder The new architecture in Fig 3.22 That requires only one single stall cycle In some machines (CISC and deeply pipelined) may take a lot more than one cycle (another advantage of RISC) 52

53 Branch Behavior in Programs From Figure 3.24 and 3.25 we can say 1. Forward branches dominates backward branches 2. Branches varies from 4% to 25% 3. Probability of backward branches being taken is higher than forward branches taken (backwards mainly are loops) How can we use these facts to improve the performance of the machine. 54

Compile Time Solutions Compiler may decide to predicts during compilation if the branch is taken or not Easiest way is to freeze or flush the pipe (simple) Predict branch is not taken, proceed as usual but be careful either not to change the state of the machine (or back out if you do) until you know if the prediction was correct or not Predict taken Useful only if we know the target address before we know the result of the comparison (condition) 55 Sceduling the Branch Delay Slot In this case, the job of the compiler is to make the successor instruction valid and useful (Fig. 3.28) In (a) the scheduled instruction should be done anyway, no harm at all IN (b) and (c) the use of R1 in the branch condition prevents moving the instruction to after the branch In both cases, it must be O.K. to execute the SUB instruction when the branch will go the opposite direction In (b) useful when the branch is taken with a high probability, the reverse in (c). 56

57 Canceling Branches Cancelling or nullifying branch works as follows The instruction include the direction that the branch was predicted. When the branch behaves as predicted, the instruction in the branch delay slot is executed as it would normally be with a delayed branch. Otherwise, the instruction is cancelled and is changed to no-op. 58

The behavior of a predict taken canceling branch depends on whether the branch is taken or not. The instruction in the delay slot is executed on;y if the branch is taken and is otherwise made into a no-op 59 Performance of branch Schemes Consider an R4000-style pipeline with the following penalty for branches What is the addition to the CPI if we assume the following. Unconditional=4%, conditional untaken =10%, conditional taken = 6% Branch Scheme Penalty unconditional Penalty untaken Penalty taken Flush pipeline 2.0 3 3 Predict taken 2.0 3 2 Predict untaken 2.0 0 3 60

Perormance Scheme Uncond Untaken Taken total Frequency 4% 10% 6% 20% stall 0.04*2=0.08 0.1*3=0.3 0.06*3=0.18 0.56 Predict taken 0.04*2=0.08 0.1*3=0.3 0.06*2=0.12 0.50 Predict untaken 0.04*2=0.08 0 0.06*3=0.18 0.26 61 Static Branch Prediction If we can predict branches, w can improve the performance of the code LW R1,0(R2) SUB R1,R1,R3 BEQZ R1,L OR R4,R5,R6 L: ADD R7,R8,R9 Almost always, R7 not needed on the fall through Rarely take, R4 no needed on the taken path LW ADD SUB BEQZ OR L: LW OR SUB BEQZ L: ADD R1,0(R2) R7,R8,R9 R1,R1,R3 R1,L R4,R5,R6 R1,0(R2) R4,R5,R6 R1,R1,R3 R1,L 62

Static Branch Predictions From Figure??, most branches are taken, so predict taken (mis-prediction ranges from 59% to 9%). Predict backward branches to be take, forward not taken (for some programs that could be beneficial). Use profile information, we run the program and keep information about what branches are taken and what are not, then use this information for regular runs. 63 Pipelines and Exceptions The problem is what to do if an instruction in the pipe caused an exception, keep in mind that there are more than one instruction in different stages of execution in the pipeline. Many situation cause exceptions, (interrupt, fault) such as I/O interrupt, operating system call, overflow or underflow, page fault, misalligned memory access, memory protection violation, undefined instructions,. 64

Exceptions (Characterization) Synchronous vs. Asynchronous: if the exception occurs in the same place each time the program is executed with the same set of input it is synchronous (failure is not). User requested vs. Coerced: If the user request them (debugging, ) User Maskable vs. nonmaskable: if the exception could be masked (not responded to) or not Within vs. between instructions: usually within instructions is harder to handle Resume, vs. terminate: Can we continue with the program or must we terminate 65 Stopping and Restarting The correct way to deal with exceptions is to stop the pipeline, deal with the exception, and restart execution, these steps are taken 1. Force a trap instruction into the pipeline on the next IF 2. Until the trap is taken (completed), turn off all writes for the faulting instructions and the following ones. Place no-op in all the latches 3. After the exception handling routine receives control, it first save the PC of the faulting instruction, in case of branch delay scheduling, we must save the addresses of all the instruction in the pipeline 66

Stopping and Restarting If the pipeline can be stopped such that all the instructions before the faulting one are completed, and all the instructions starting from the faulting one can restart from scratch, the pipeline is said to have precise exception. If the state of the machine changes before detecting the exception (FP op) then we must be able to retrieve the old values (done by the hardware). Some machines has two modes, one is precise interrupt for debugging and one without for fast execution. In precise mode, we may limit the amount of overlapping. 67 Exception in DLX Every state (except WB) can cause an exception, namely Page fault, undefined op-code, arithmetic exception,page fault. Consider the following case LW IF ID EX MEM WB ADD IF ID EX MEM WB LW may cause an exception in MEM, and ADD can cause an exception in IF, that will be detected earlier. Should not deal with ADD before LW 68

Exception in DLX The pipeline will not handle exception as they occur (out of order handling) The hardware posts the exceptions caused by any instruction in a status vector associated with that instruction, and turn off any control signal that might lead to writing. The exception status vector is carried along with the instruction Once the instruction is about to finish MEM, that status vector is examined, and exception is handled in order. 69 Instruction Set Complication Once the instruction is guaranteed to be completed, it is called committed In DLX, instruction is committed after the MEM stage, no change in the state of the machine before that. That is easy to handle, but on some machines, that might not be the case For example in autoincrement addressing, we must be able to back out of any changes. VAX and x86 string copying instruction uses the GPR as working registers, may restart from the middle of the instruction instead from the beginning 70