Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time units. Resource needed: one general-purpose machine. Productivity: one product per N time units. Pipelined execution of an N-stage task: Basic Concepts 3 3 N N 3 N Production time: N time units. Resource needed: N specialpurpose machines. Productivity one product/time unit. Zebo Peng, IDA, LiTH
Instruction Execution Stages A typical instruction execution sequence:. Fetch Instruction (): Fetch the instruction from memory.. Decode Instruction (): Determine the op-code and the operand specifiers. 3. Calculate Operands (): Calculate the effective addresses (e.g., base addr. + offset -> effective addr.). 4. Fetch Operands (): Fetch the operands from memory. 5. Execute Instruction (): perform the operation. 6. Write Operand (): store the result in memory. Zebo Peng, IDA, LiTH 3 Instruction Pipelining I I time This is the ideal case Speed up by 6 times. I3 I4 I5 I6 I7 I8 I9 Zebo Peng, IDA, LiTH 4
Actual Instruction Pipelining I I I3 time In practice, there are many holes (bubbles), which reduces the speed-up factor. I4 I5 I6 I7 I8 Different execution patterns for different instructions! I9 Zebo Peng, IDA, LiTH 5 Number of Pipeline Stages In general, a larger number of stages gives better performance. However: A larger number of stages increases the overhead in moving information between stages and synchronization between stages. The complexity of the CPU grows with the number of stages. Difficult to keep a long pipeline at maximum rate due to the hazards. A stage may not be able to be divided into sub-stages. Intel 80486 and Pentium: Five-stage pipeline for integer instructions. Eight-stage pipeline for FP (floating points) instructions. Pentium 4 machine has 0 stages (!). IBM PowerPC has different numbers of stages (3-9) for different machines, e.g., PowerPC 440 has 7 stages. Zebo Peng, IDA, LiTH 6 3
Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH 7 Pipeline Hazards (Conflicts) Situations that prevent the next instruction in the instruction stream from executing during its designated clock cycle. The instruction is said to be stalled. When an instruction is stalled: All instructions later in the pipeline than it are also stalled; No new instructions are fetched during the stall; Instructions earlier than the stalled one continue as usual. Types of hazards: Structural hazards Data hazards Control hazards Zebo Peng, IDA, LiTH 8 4
Structural (Resource) Hazards Hardware conflicts caused by the use of the same hardware resource at the same time (e.g., memory conflicts). stall stall Penalty: cycle (Note: the performance lost is multiplied by the number of stages). Zebo Peng, IDA, LiTH 9 Structural Hazard Solutions In general, the hardware resources in conflict are duplicated in order to avoid structural hazards, e.g., two ALUs. Functional units (ALU, FP unit) can also be pipelined themselves to support several instructions at the same time. Memory conflicts can be solved by: having two separate caches, one for instructions and the other for data (Harvard Architecture); using multiple banks of the main memory; or keeping as many intermediate results as possible in the registers (!). Zebo Peng, IDA, LiTH 0 5
Data Hazards Caused by reversing the order of data-dependent operations due to the pipeline (e.g., WRITE/READ conflicts). ADD A, R; Mem(A) Mem(A) + R; SUB A, R; Mem(A) Mem(A) - R; ADD SUB Ex. A=00, Mem(00)=00, R=30, R=50 Value of Mem(A) needed Value of Mem(A) available Zebo Peng, IDA, LiTH Data Hazard Penalty ADD A, R; Mem(A) Mem(A) + R; SUB A, R; Mem(A) Mem(A) - R; ADD SUB Stall Stall Value of Mem(A) available Data hazard is an important issue: Penalty: cycles. It happens very often, since we have many data dependencies. Zebo Peng, IDA, LiTH 6
Data Hazard Solutions The penalty due to data hazards can be reduced by a technique called forwarding (bypassing). MUX ALU MUX Bypass Path Memory System Registers, cache and main memory The ALU result is fed back to the ALU input. If it detects that the value needed for an operation is the one produced by the previous one, and has not yet been written back. ALU selects the forwarded result, instead of the value from the memory system. Zebo Peng, IDA, LiTH 3 Control Hazards Caused by branch instructions, which change the instruction execution order. The branch should be taken! time BRA 5 IF Zero 3 4 5 6 5 This may lead to unintended change of memory contents (!) 6 Zebo Peng, IDA, LiTH 4 7
Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH 5 Branch Handling () Stall the pipeline until the branch outcome is known. The branch should be taken! time BRA 5 IF Zero 5 Stall Stall Stall Stall 6 This leads to very large lost of performance, in particular, since 0%-35% of the executed instructions are branches. Zebo Peng, IDA, LiTH 6 8
Branch Handling () Multiple streams implement hardware resources to deal with different branch alternatives. 3 4 5 6 5 6 7 8 Branch condition is known BRA 5 IF Zero Zebo Peng, IDA, LiTH 7 time This is an expensive solution, and you need special memory technique and control to fully utilize it! Branch Handling (3) Pre-fetch branch target when a conditional branch is recognized, the following instruction is fetched, and the branch target is also pre-fetched. Loop buffer use a small, high-speed memory to keep the n most recently fetched instructions. If a branch is to be taken, the buffer is first checked to see if the branch target is in it. Special cache for branch target instructions. Delayed branch re-arrange the instructions so that branching occurs later than originally specified. - Software solution. Zebo Peng, IDA, LiTH 8 9
Delayed Branch Ex (-Stage Pipeline) ADD Original inst. sequence: ADD X; No data-dependence BRA L; Delayed branch: BRA L; ADD X; BRA Branch condition known BRA ADD The compiler or the programmer has to find an instruction (to be executed regardless of the branch outcome) which can be moved from its original place to the branch delay slot. 60% to 85% success rate. This leads, however, to un-readable code. Zebo Peng, IDA, LiTH 9 Branch Prediction When a branch is encountered, a prediction is made and the predicted path is followed. The instructions on the predicted path are fetched. The fetched instruction can also be executed, which is called Speculative Execution. Results produced of these executions should be marked as tentative. When the branch outcome is decided, if the prediction is correct, the special tags on tentative results are removed. If not, the tentative results are removed, and the execution goes to the other path. Branch prediction can base on static or dynamic information. Zebo Peng, IDA, LiTH 0 0
Predict always taken Assume that jump will happen. Always fetch target instruction. Static Branch Prediction time BRA 5 IF Zero 5 6 Zebo Peng, IDA, LiTH Static Branch Prediction (Cont d) Predict never taken Assume that jump will not happen. Always fetch next instruction. Predict by Operation Codes Some instructions are more likely to result in a jump than others. BNZ (Branch if the result is Not Zero) Jump more likely BEZ (Branch if the result equals Zero) Jump less likely Can get up to 75% success. Predict by relative positions Backward-pointing branches will be taken (usually loop back). Forward-pointing branches will not be taken (often loop exist). Zebo Peng, IDA, LiTH
Dynamic Branch Prediction Based on branch history: Store information regarding branches in a branch-history table so as to predict the branch outcome more accurately. E.g., assuming that the branch will do what it did last time. Instruction Address Branch Ins. Branch Target Address Address 3 Branch History Table Match/No-match Not Taken (0) N One-Bit Predictor: T N Taken () Two errors per loop iteration. N N T N N NNNNT N N T Predicated Target Address Zebo Peng, IDA, LiTH 3 Bimodal Prediction Use a -bit saturating counter to predict the most common direction, where the first bit indicates the predication. Branches evaluated as taken (T) increment the state towards strongly taken; and Branches evaluated as not taken (N) decrement the counter towards strongly not taken. It tolerates a branch going an unusual direction one time. A loop-closing branch is miss-predicted once rather than twice. N N T N NNNNNTN N Strongly Not Taken (00) N T (+) T (+) T (+) Not Taken (0) Taken (0) N (-) N (-) N (-) Strongly Taken () T Zebo Peng, IDA, LiTH 4
Summary Instruction execution can be substantially accelerated by instruction pipelining. A pipeline is organized as a succession of N stages. Ideally N instructions can be active inside a pipeline. Keeping a pipeline at its maximal rate is prevented by pipeline hazards. Structural hazards are due to resource conflicts. Data hazards are caused by data dependencies between instructions. Control hazards are produced as consequence of branch instructions. Branches can dramatically affect pipeline performance. It is very important to reduce penalties produced by them. (Dynamic) prediction is an efficient way to address this problem. Zebo Peng, IDA, LiTH 5 3