Instruction Level Parallelism Appendix C and Chapter 3, HP5e
Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP.
Implementation of RISC ISA - Stages Instruction Fetch (IF) Instruction Decode/Register Fetch (ID) Fixed field decoding Execution/Effective address (EX) Memory Access (MEM) Write back (WB)
ALU MIPS Datapath IF ID EX MEM WB 4 ADD NP C Zero? Cond M U X P C IM IR rs rt Regs rd A B M U X M U X ALU Output DM LM D M U X Sign Extend 16 32 Imm Instruction Fetch Instruction Decode/ Register Fetch Execute/ Address Calculation Memory Access Write Back
B A Multiple Issue Integer Pipeline Zero? IR0 IM RF Read RF Write IR1 DM IF ID EX MEM WB
Pipeline Performance An unpipelined processor has 1ns clock cycle. ALU Operation and branches take 4 cycles and Memory ops take 5 cycles. Relative frequencies of the operations are 40%, 20%, and 40%. Suppose Clock skew and setup, pipelining adds 0.2ns of overhead to the clock. What is the speedup? Average Instruction Execution time = Clock cycle * Average CPI n CPI = i =1 IC i InstructionCount CPI i
Dependences Pipeline Hazards Structural & Data
Data dependences Name dependences Structural hazards Data hazards Stalling, Forwarding Outline
Basic Block A straight line code sequence with no branches in except to the entry and no branches out except at the exit Loop: L.D ADD.D S.D F0, 0(R1) F4, F0, F2 F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop
Dependence for (i=0; i<=999; i=i+1) x[i] = x[i] + a; Data Dependence (RAW) Name Dependences (WAR, WAW) Name dependences Register renaming Hazard Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop ADD.D F4, F0, F2 ADD.D F4, F6, F8 Overlap during execution could change the order of access to the operand involved in the dependence.
Hazards Program Order ILP preserves program order only where it affects the outcome of the program Structural Hazards Resource conflicts Data Hazards RAW, WAW, WAR Control Hazard Whether or not an instruction should be executed depends on a control decision made by an earlier instruction
Structural Hazard 1 2 3 4 5 6 7 8 9 i1 i2 i3 i4 i5... MEM ID EX MEM WB MEM ID EX MEM WB MEM ID EX MEM WB MEM ID EX MEM WB MEM ID EX MEM WB HAZARD!!! Unified Memory example Register File WB, ID example.
Cost of a Load Structural Hazard Data references constitute 40% of the instruction mix. Ideal CPI = 1 (with no structural hazards). Assume that the processor with the structural hazard has a clock rate that is 1.1 times higher than the clock rate of the processor without the hazard. Which processor is faster, and by how much? Avg. Instruction Time =CPI Clock cycle time Avg. InstructionTime ideal =CPI Clock cycle time ideal
Cost of a Load Structural Hazard Avg. Instruction Time =CPI Clock cycle time Avg. InstructionTime =(1+0.4 1) Clock cycle time ideal 1.1 Avg. InstructionTime =1.27 Clock cycle time ideal
ALU Data Hazards R1 is updated in the WB stage. IR IR IR 4 ADD NP C Zero? Cond M U X P C IM IR rs rt Regs rd A B M U X M U X ALU Output DM LM D M U X R1 R2 + R3 R4 R1 + R5 Sign Extend 16 32 Imm
How to overcome this hazard? Data Dazard Time (clock cycles) R1 R2 + R3 R4 R1 + R5 IF ID IF EX MA WB ID EX MA WB IF ID EX MA WB IF ID EX MA WB Wrong register values!!!!!! IF ID EX MA WB
Stalled Stages and Pipeline Bubbles Time (clock cycles) R1 R2 + R3 R4 R1 + R5 IF ID IF EX MA WB ID IF ID IF ID IF EX MA WB ID EX MA WB Stalled Stages IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB I1 I 2 I1 I 3 I 2 I1 I I I I I 3 3 3 4 5 I I I I I I 2 2 2 3 4 5 nop nop nop I I I 2 3 4 I1 nop nop nop I I 2 3 I1 nop nop nop I 2 I 5 I 4 I 3 I 5 I 4 I 5
Resolving Data Hazards Stalling one of the instructions Data Forwarding (Bypassing) Scheduling hazardous instructions away from each other
ALU Stalling (Interlocking) Stall Condition NOP IR IR IR 4 ADD NP C Zero? Cond M U X P C IM IR rs rt Regs rd A B M U X M U X ALU Output DM LM D M U X R1 R2 + R3 R4 R1 + Sign Extend 16 32 Imm
Pipeline Performance Speedup pipelining = CPI unpipelined CPI pipelined Speedup pipelining = Pipeline depth 1+ Stall cycles per instruction
Forwarding DADD DSUB AND OR XOR R1,R2,R R4,R1,R 3 5R6,R1,R 7R8,R1,R 9R10,R1,R1 1 Time (clock cycles) DADD IM REG ALU DM REG DSUB IM REG ALU DM REG AND IM REG ALU DM REG
Forwarding Before Bypassing Time (clock cycles) R1 R2 + R3 R4 R1 + R5 CPI > 1 IF ID IF EX MA WB ID IF ID IF ID IF Stalled Stages ID IF EX MA WB ID EX MA WB After Bypassing Time (clock cycles) R1 R2 + R3 R4 R1 + R5 CPI = 1 IF ID IF EX MA WB ID EX MA WB IF ID EX MA WB
Cost of Forwarding In longer pipelines? In multiple issue pipelines? All the dependences have been solved?
Forwarding Forwarding cannot solve all data dependence problems LD R2, 4(R1) ADD R4, R2, R3
Forwarding Forwarding cannot solve all data dependence problems LD R2, 4(R1) ADD R4, R2, R3 Time (clock cycles) LD IM REG ALU DM REG ADD IM REG ALU DM REG
Forwarding - Stall Condition Forwarding cannot solve all data dependence problems LD R2, 4(R1) ADD R4, R2, R3 Time (clock cycles) LD IM REG ALU DM REG ADD IM REG REG ALU DM STALL REG
Instruction Level Parallelism Static Scheduling
Outline ILP Multicycle instructions Loop unrolling, scheduling Superscalar pipelines
ILP Instruction-level parallelism: overlap among instructions: pipelining or multiple instruction execution What determines the degree of ILP? dependences: property of the program hazards: property of the pipeline
Pipeline Scheduling Reorder instructions so that dependent instructions are far enough apart Done by the compiler, before the program runs: Static Instruction Scheduling Done by the hardware, when the program is running: Dynamic Instruction Scheduling
Static vs. Dynamic Scheduling Dynamic scheduling: requires complex structures to identify independent instructions (scoreboards, issue queue) high power consumption low clock speed high design and verification effort Static: Compiler can compute instruction latencies and dependences
Pipeline Scheduling Original Program LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Scheduled Code LW R3, 0(R1) LW R13, 0(R11) ADDI R5, R3, 1 ADD R2, R2, R3 ADD R12, R13, R3 Total Execution Cycles: 7 Total Execution Cycles: 5
Why is Pipelining Hard to Implement? Interrupts, Exceptions, Traps, etc.
Outline Exception Handling Precise and Imprecise exceptions Exceptions in OoO pipelines
Exceptions Events that request attention of of the processor
Stopping and Restarting Execution Trap instruction, Turn off writes, Save PC, Save processor state, (Disable Exceptions), Exception handler, RFE Precise exceptions Pipeline stage IF ID EX MEM WB Problem exceptions occurring Page fault on IF, misaligned memory access; memory protection violation Undefined or illegal opcode Arithmetic exception Page fault on data fetch; misaligned memory access; memory protection violation None
Precise Exception Handling LD
Precise Exceptions LD IF ID EX MEM WB DADD IF ID EX MEM WB Multiple exceptions in the same cycle Early exception by a later instruction Instruction Status Vector: Check before commit
Precise Exceptions Instruction Status Vector: Check before commit
Multi-cycle Operations Pipeline
Precise Exceptions DIV.D ADD.D SUB.D F0, F2, F4 F10, F10, F8 F12, F12, F14 Out of order completion Can't ignore exceptions Virtual Memory, IEEE 754 Fast mode vs. Slow mode with precise exceptions Store results of earlier operations in a buffer History file, Future file.
Outline Exception Handling Precise and Imprecise exceptions Exceptions in OoO pipelines
References HP5e. Appendix C Pipelining: Basic and Intermediate Concepts. HP5e. Chapter 3 Instruction-Level Parallelism and Its Exploitation. Smith and Plezskun, Implementing Precise Interrupts in Pipelined Processors, IEEE Transactions on Computers 1988