Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Size: px

Start display at page:

Download "Instruction Level Parallelism. Appendix C and Chapter 3, HP5e"

Neil Robertson
5 years ago
Views:

1 Instruction Level Parallelism Appendix C and Chapter 3, HP5e

2 Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP.

3 Implementation of RISC ISA - Stages Instruction Fetch (IF) Instruction Decode/Register Fetch (ID) Fixed field decoding Execution/Effective address (EX) Memory Access (MEM) Write back (WB)

4 ALU MIPS Datapath IF ID EX MEM WB 4 ADD NP C Zero? Cond M U X P C IM IR rs rt Regs rd A B M U X M U X ALU Output DM LM D M U X Sign Extend Imm Instruction Fetch Instruction Decode/ Register Fetch Execute/ Address Calculation Memory Access Write Back

5 B A Multiple Issue Integer Pipeline Zero? IR0 IM RF Read RF Write IR1 DM IF ID EX MEM WB

6 Pipeline Performance An unpipelined processor has 1ns clock cycle. ALU Operation and branches take 4 cycles and Memory ops take 5 cycles. Relative frequencies of the operations are 40%, 20%, and 40%. Suppose Clock skew and setup, pipelining adds 0.2ns of overhead to the clock. What is the speedup? Average Instruction Execution time = Clock cycle * Average CPI n CPI = i =1 IC i InstructionCount CPI i

7 Dependences Pipeline Hazards Structural & Data

8 Data dependences Name dependences Structural hazards Data hazards Stalling, Forwarding Outline

9 Basic Block A straight line code sequence with no branches in except to the entry and no branches out except at the exit Loop: L.D ADD.D S.D F0, 0(R1) F4, F0, F2 F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop

10 Dependence for (i=0; i<=999; i=i+1) x[i] = x[i] + a; Data Dependence (RAW) Name Dependences (WAR, WAW) Name dependences Register renaming Hazard Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop ADD.D F4, F0, F2 ADD.D F4, F6, F8 Overlap during execution could change the order of access to the operand involved in the dependence.

11 Hazards Program Order ILP preserves program order only where it affects the outcome of the program Structural Hazards Resource conflicts Data Hazards RAW, WAW, WAR Control Hazard Whether or not an instruction should be executed depends on a control decision made by an earlier instruction

12 Structural Hazard i1 i2 i3 i4 i5... MEM ID EX MEM WB MEM ID EX MEM WB MEM ID EX MEM WB MEM ID EX MEM WB MEM ID EX MEM WB HAZARD!!! Unified Memory example Register File WB, ID example.

13 Cost of a Load Structural Hazard Data references constitute 40% of the instruction mix. Ideal CPI = 1 (with no structural hazards). Assume that the processor with the structural hazard has a clock rate that is 1.1 times higher than the clock rate of the processor without the hazard. Which processor is faster, and by how much? Avg. Instruction Time =CPI Clock cycle time Avg. InstructionTime ideal =CPI Clock cycle time ideal

14 Cost of a Load Structural Hazard Avg. Instruction Time =CPI Clock cycle time Avg. InstructionTime =( ) Clock cycle time ideal 1.1 Avg. InstructionTime =1.27 Clock cycle time ideal

15 ALU Data Hazards R1 is updated in the WB stage. IR IR IR 4 ADD NP C Zero? Cond M U X P C IM IR rs rt Regs rd A B M U X M U X ALU Output DM LM D M U X R1 R2 + R3 R4 R1 + R5 Sign Extend Imm

16 How to overcome this hazard? Data Dazard Time (clock cycles) R1 R2 + R3 R4 R1 + R5 IF ID IF EX MA WB ID EX MA WB IF ID EX MA WB IF ID EX MA WB Wrong register values!!!!!! IF ID EX MA WB

17 Stalled Stages and Pipeline Bubbles Time (clock cycles) R1 R2 + R3 R4 R1 + R5 IF ID IF EX MA WB ID IF ID IF ID IF EX MA WB ID EX MA WB Stalled Stages IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB I1 I 2 I1 I 3 I 2 I1 I I I I I I I I I I I nop nop nop I I I I1 nop nop nop I I 2 3 I1 nop nop nop I 2 I 5 I 4 I 3 I 5 I 4 I 5

18 Resolving Data Hazards Stalling one of the instructions Data Forwarding (Bypassing) Scheduling hazardous instructions away from each other

19 ALU Stalling (Interlocking) Stall Condition NOP IR IR IR 4 ADD NP C Zero? Cond M U X P C IM IR rs rt Regs rd A B M U X M U X ALU Output DM LM D M U X R1 R2 + R3 R4 R1 + Sign Extend Imm

20 Pipeline Performance Speedup pipelining = CPI unpipelined CPI pipelined Speedup pipelining = Pipeline depth 1+ Stall cycles per instruction

21 Forwarding DADD DSUB AND OR XOR R1,R2,R R4,R1,R 3 5R6,R1,R 7R8,R1,R 9R10,R1,R1 1 Time (clock cycles) DADD IM REG ALU DM REG DSUB IM REG ALU DM REG AND IM REG ALU DM REG

22 Forwarding Before Bypassing Time (clock cycles) R1 R2 + R3 R4 R1 + R5 CPI > 1 IF ID IF EX MA WB ID IF ID IF ID IF Stalled Stages ID IF EX MA WB ID EX MA WB After Bypassing Time (clock cycles) R1 R2 + R3 R4 R1 + R5 CPI = 1 IF ID IF EX MA WB ID EX MA WB IF ID EX MA WB

23 Cost of Forwarding In longer pipelines? In multiple issue pipelines? All the dependences have been solved?

24 Forwarding Forwarding cannot solve all data dependence problems LD R2, 4(R1) ADD R4, R2, R3

25 Forwarding Forwarding cannot solve all data dependence problems LD R2, 4(R1) ADD R4, R2, R3 Time (clock cycles) LD IM REG ALU DM REG ADD IM REG ALU DM REG

26 Forwarding - Stall Condition Forwarding cannot solve all data dependence problems LD R2, 4(R1) ADD R4, R2, R3 Time (clock cycles) LD IM REG ALU DM REG ADD IM REG REG ALU DM STALL REG

27 Instruction Level Parallelism Static Scheduling

28 Outline ILP Multicycle instructions Loop unrolling, scheduling Superscalar pipelines

29 ILP Instruction-level parallelism: overlap among instructions: pipelining or multiple instruction execution What determines the degree of ILP? dependences: property of the program hazards: property of the pipeline

30 Pipeline Scheduling Reorder instructions so that dependent instructions are far enough apart Done by the compiler, before the program runs: Static Instruction Scheduling Done by the hardware, when the program is running: Dynamic Instruction Scheduling

31 Static vs. Dynamic Scheduling Dynamic scheduling: requires complex structures to identify independent instructions (scoreboards, issue queue) high power consumption low clock speed high design and verification effort Static: Compiler can compute instruction latencies and dependences

32 Pipeline Scheduling Original Program LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Scheduled Code LW R3, 0(R1) LW R13, 0(R11) ADDI R5, R3, 1 ADD R2, R2, R3 ADD R12, R13, R3 Total Execution Cycles: 7 Total Execution Cycles: 5

33 Why is Pipelining Hard to Implement? Interrupts, Exceptions, Traps, etc.

34 Outline Exception Handling Precise and Imprecise exceptions Exceptions in OoO pipelines

35 Exceptions Events that request attention of of the processor

36 Stopping and Restarting Execution Trap instruction, Turn off writes, Save PC, Save processor state, (Disable Exceptions), Exception handler, RFE Precise exceptions Pipeline stage IF ID EX MEM WB Problem exceptions occurring Page fault on IF, misaligned memory access; memory protection violation Undefined or illegal opcode Arithmetic exception Page fault on data fetch; misaligned memory access; memory protection violation None

37 Precise Exception Handling LD

38 Precise Exceptions LD IF ID EX MEM WB DADD IF ID EX MEM WB Multiple exceptions in the same cycle Early exception by a later instruction Instruction Status Vector: Check before commit

39 Precise Exceptions Instruction Status Vector: Check before commit

40 Multi-cycle Operations Pipeline

41 Precise Exceptions DIV.D ADD.D SUB.D F0, F2, F4 F10, F10, F8 F12, F12, F14 Out of order completion Can't ignore exceptions Virtual Memory, IEEE 754 Fast mode vs. Slow mode with precise exceptions Store results of earlier operations in a buffer History file, Future file.

42 Outline Exception Handling Precise and Imprecise exceptions Exceptions in OoO pipelines

43 References HP5e. Appendix C Pipelining: Basic and Intermediate Concepts. HP5e. Chapter 3 Instruction-Level Parallelism and Its Exploitation. Smith and Plezskun, Implementing Precise Interrupts in Pipelined Processors, IEEE Transactions on Computers 1988

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction Instruction Level Parallelism ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction Basic Block A straight line code sequence with no branches in except to the entry and no branches