Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Ti5317000 Parallel Computing PIPELINING Michał Roziecki, Tomáš Cipr 2005-2006

Introduction to pipelining What is this What is pipelining? Pipelining is an implementation technique in which multiple instructions are overlapped in execution Today, Pipelining is key to making processors fast

Introduction to pipelining - How it works The basic action of any microprocessor as it moves through the instruction stream can be broken down into a series of four simple steps, which each instruction in the code stream goes through in order to be executed: Fetch the next instruction from the address stored in the program counter. Store that instruction in the instruction register and decode it, and increment the address in the program counter. Execute the instruction currently in the instruction register. If the instruction is not a branch instruction but an arithmetic instruction, send it to the proper ALU. Write the results of that instruction from the ALU back into the destination register.

Introduction to pipelining - How it works The total execution time for each individual instruction is not changed by pipelining. It still takes an instruction 4ns to make it all the way through the processor. Pipelining doesn't speed up instruction execution time, but it does speed up program execution time by increasing the number of instructions finished per unit time. Fig. A four-stage pipeline

Introduction to pipelining - RISC vs. CISC RISC Small numer of instructions Simple instructions Low cycles per second, large code sizes Single-clock, reduced instruction only CISC Big numer of instructions Complex instructions Small code sizes, high cycles per second Includes multi-clock complex instructions

Introduction to pipelining Pielining vs. Single-cycle processors Single-cycle processor Pipelining processor For single-cycle processor it takes 16 nanosecond to execute four instructions, while for pipelining processor it takes only 7 nanoseconds.

Introduction to pipelining Counting example Suppose we execute 100 instructions Single Cycle Machine 45 ns/cycle x 1 CPI x 100 inst = 4500 ns Ideal pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns

Introduction to pipelining Charecteristics of Pipelined Processor Design Main memory must operate in one cycle Instruction and data memory must appear separate Few buses are used Data is latched (stored in temporary registers) at each pipeline stage-called pipeline registers ALU operations take only 1 clock cycle

Introduction to pipelining Pipelining history IBM 360/91 - First implemented pipelining - Performance increased 2,5 to 25 times P6 (Pentium Pro) - Superscalar level 3 processor - Included 3 pipelines Future?

Pipelining issues Dependance among instructions Execution of some instructions can depend on the completion of others instructions in the pipeline One solution is to stall the pipeline Dependences involving registers can be detected and data forwarded to instruction needing it, without waiting for register write Dependence involving memory is harder and is sometimes addressed by restricting.

Pipelining issues Instructions Adapting All instructions must fit into a common pipeline stage structure A 5-stage pipeline is typically used in RISC processors Instruction fetch Decode and operand access ALU operations Data memory access Register write

Pipelining issues - Hazards Hazard is a result of dependency, it occurs when two or more of these simultaneous (possibly out of order) instructions conflict. There are typically three types of hazards: Data hazards Structural hazards Branching hazards

Pipelining issues - Data hazards Data hazards occur when data is modified. There are three situations it can occur in: Read after Write (RAW): Memory is modified and read soon after. Write after Read (WAR): Read from a memory location and write soon after to that location. Write after Write (WAW): Two instrutions that write to memory are performed.

Pipelining issues - Data hazards - Classification and possible solution Bubbling the Pipeline As instructions are fetched, control logic determines whether or not a hazard could/will occur. If this is true, then the control logic inserts NOPs into the pipeline. Forwarding Forwarding is implemented by feeding back the output of an instruction into the previous stage(s) of the pipeline as soon as the output of that instruction is available.

Pipelining issues - Structural hazards A structural hazard occurs when a part of the processor's hardware is needed by two or more instructions at the same time. A structural hazard might occur, for instance, if a program were to execute a branch instruction followed by a computation instruction. Because they are executed in parallel, and because branching is typically slow, it is quite possible (depending on architecture) that the computation instruction and the branch instruction will both require the ALU at the same time.

Pipelining issues - Branch hazards Branching hazards (also known as control hazards) occur when the processor is told to branch. If a certain condition is true, jump from one part of the instruction stream to another one - not necessarily the next one sequentially. In such a case, the processor cannot tell in advance whether or not it should process the next instruction.

Pipelining issues - Branch prediction The microprocessor tries to predict whether the branch instruction will jump or not, based on a record of what this branch has done previously. If the prediction turns out to be wrong, then it has to flush the pipeline and discard all calculations that were based on this prediction. But if the prediction was correct, then it has saved a lot of time. Different kinds of Branch preiction: Trivial prediction Static prediction Local branch prediction Combined branch prediction

THE SPEEDUP FROM PIPELINING - Speedup against single-cycle processor - What affects speedup? - Theoretical vs. real-world speedup - Two views of speedup; according to: - Number of pipeline stages - Instruction throughput

Speedup and pipeline depth - Pipeline depth = Number of stages - Instruction completed each clock cycle - Slicing more the instruction processing, faster the clock frequency can be - Speedup is ideally equal to pipeline depth - 4-stage pipeline ~ 4-time speedup - 8-stage pipeline ~ 8-time speedup

Speedup and pipeline depth (contd.) - Speedup = pipeline depth in reality? No! - Why not? - Equal duration of stages has to be preserved - Perfect splitting into equal stages impossible - Clock cycle suited to the slowest stage - More finely the stages are sliced, greater is speedup

Speedup: Theory vs. Real-World 20 18 16 14 Relative Speedup 12 10 8 6 4 2 2 4 6 8 10 12 14 16 18 20 Pipeline Depth

Speedup and instruction throughput - Speedup also affected by pipelining process itself - Instruction throughput = Instructions per clock - IPC = 1, in single-cycle processor - IPC = 1, in ideal pipelined processor - Issues in real-world - Pipeline filling - Pipeline stalls

Throughput and pipeline filling - Pipeline needs several clock cycles to fill up - No instructions completed up until now - Average IPC of 1 is limit for reality - More cycles the pipeline runs, greater is the average IPC and so speedup - Example 4-stage pipeline: - After 5 cycles: IPC = 1 instruction / 5 = 0.2 - After 100 cycles: IPC = 96 / 100 = 0.96

Average instruction throughput Average Instruction Throughput 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 20 40 60 80 100 Clock Cycles Ideal pipeline vs. ideal 4-stage pipeline

Throughput and pipeline stalls - Pipeline can be kept full after filling? No! - Pipeline was still idealized - Real pipeline has to deal with hazards - Pipeline stalls - Pipeline flush - Speedup remains in decreasing

Instruction throughput with two-cycle stall Average Instruction Throughput 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 20 40 60 80 100 Clock Cycles

ENHANCED PIPELINING - Is it possible to break the limit of one instruction per clock? Yes, but - Low-level parallelism needed - Instruction-level parallelism - Superscalar - Very Long Instruction Word (VLIW) - Data-level parallelism

Instruction-level parallelism - Speedup gained by adding more hardware - Make use of inherent parallelism - Superscalar - Dynamically distributes instructions to function units - VLIW - long instruction words with many operations statically compiled to a single word

Data-level parallelism - Based on SIMD concept - Single instruction is executed over short vectors - Benefit gained in specific applications - Multimedia - Complexity of processor is not much increased

PIPELINING ON PENTIUM 4 - Hyperpipelining technology - 20-stage pipeline - Clock frequency increased by 40% - Advanced branch prediction - 4 Kb branch target buffer - Successful prediction in 93-94%

SUMMARY - Pipelining characteristics - Fetch, Decode, Execute, Write - Single-cycle vs. pipelined processor - Pipelining issues - Hazards - The speedup from pipelining - Pipeline depth - Enhanced pipelining - Low-level parallelism