CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

Size: px

Start display at page:

Download "CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines"

Claude Daniel
5 years ago
Views:

1 CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015

2 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell Pipeline

3 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell Pipeline

4 Recall: Compute-Bound programs Goal: Achieve maximum IPC instructions per cycle property of the pipeline/cpu Bottlenecks Fetch Performance (branches/control hazards, i-cache misses) Pipeline Hazards (data dependencies, long-latency instructions)

5 The Out-of-Order Superscalar Processor Arguably, most-successful high-performance processor design Intel, AMD, PowerPC ARM in recent incarnations Nearly all computers today use these processors Out-of-order (OoO) execution Instructions execute when their data is ready Similar to dataflow machines Superscalar execution multiple instructions can execute concurrently not specific to OoO, but required for performance implies IPC > 1 requires instruction-level parallelism (ILP)

6 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell Pipeline

7 The Dependency Graph a = 1 b = 2 c = a + b d = b e = c + d

8 The Dependency Graph a = 1 b = 2 c = a + b d = b e = c + d

9 The Dependency Graph a = 1 b = 2 c = a + b d = b e = c + d

10 The Dependency Graph a = 1 b = 2 c = a + b d = b e = c + d

11 The Dependency Graph a = 1 b = 2 c = a + b d = b e = c + d Note: arrows show dataflow

12 Scheduling from the Dependency Graph a = 1 b = 2 c = a + b d = b e = c + d Identify nodes with no incoming arrows Schedule them Remove their outgoing arrows Repeat a = 1 : b = 2 c = a + b : d = b e = c + d

13 Name Dependencies a = 1 b = 2 c = a + b a = 4 d = a + b b = 2 a = 4 c = a + b a = 1 d = a + b

14 Name Dependencies (After Renaming) a = 1 b = 2 c = a + b a = 4 d = a + b a = 1 b = 2 a' = 4 Only registers get renamed ISA registers get mapped to physical registers Internal table keeps track of assignment c = a + b d = a' + b

15 Memory Aliasing a[i] = 1 x = a[j] a[j]= x + 1 a[i] = 1 a[i] = 1 x = a[j] x = a[j] a[j] = x + 1 a[j] = x + 1

16 Branches a = 1 b = p[i] c = a > b if c then b = b + 1 else b = 2 * b endif d = b a = 1 c = a > b? b = p[i]

17 Basic Blocks A basic block is: a set of instructions with one entry and one exit Instructions within basic block can be reordered freely reordering only respect dependencies Is OoO scheduling only within basic blocks good enough?

18 Basic Block Statistics SPEC CPU INT 2006 Benchmark Avg. BB size (inst) 400.perlbench bzip gobmk hmmer sjeng libquantum h264ref omnetpp astar xalancbmk mcf gcc 7.14 Ganesan et al., Synthesizing Memory-level Parallelism Aware Miniature Clones for SPEC CPU2006 and ImplantBench, ISPASS 2010

19 Basic Block Statistics SPEC CPU FP 2006 Benchmark Avg. BB size (inst) 410.bwaves milc zeusmp gromacs cactusADM leslie3d namd soplex GemsFDTD sphinx Ganesan et al., Synthesizing Memory-level Parallelism Aware Miniature Clones for SPEC CPU2006 and ImplantBench, ISPASS 2010

20 Beyond Basic Block Scheduling Most OoO have a lot more than 4 instructions in-flight MIPS R10K had 32 instructions Intel Haswell has 192 micro-ops Must break control dependences Predict Branches Speculatively execute instructions beyond branch Will encounter multiple branches

21 Branch Prediction Branch prediction is needed for: Conditional branches (loops, if then) Indirect branches #1 (switch, computed GOTOs) Indirect branches #2 (function pointers) Indirect branches #3 (return) Static Branch Prediction used first time a branch is seen default behaviour e.g. backward branches are always taken affects code layout Dynamic Branch Prediction based on runtime history table indexed by branch program counter

22 Speculation Speculative execution implies one or more branch outcomes are unknown processor state checkpointed at every prediction On a correct prediction dependent instructions are made non-speculative checkpoint is discarded On a misprediction throw away all branch-dependent instructions rollback state to checkpoint before mispredicted branch expensive

23 In-order Retirement Instructions must appear to finish in order Reorder Buffer queue of instructions entry allocated when instruction fetched entry released when instruction retires buffers speculative instructions until branch is resolved Retirement only occurs when an instruction and its predecessors are no longer speculative all prior instructions have retired note: predecessors and prior are in fetch order

24 Summary Identify independent instructions in instruction stream Requires renaming of registers Requires memory addresses Requires branch outcomes Predict branches if necessary Issue independent instructions to functional units As instructions complete, issue waiting instructions whose data is now ready As branch outcomes become known: terminate speculative execution may involve throwing away wrong-path instructions

25 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell Pipeline

26 The Intel Haswell Pipeline from the Intel 64 and IA-32 Architectures Optimization Reference Manual

27 Instruction Fetch and Decode Intel processors have a complex instruction coding Lengths vary Operands may be registers or memory CISC-style First step, convert to RISC-like micro-ops micro-ops are like instructions not visible to programmer more uniform than Intel ISA Pipeline executes micro-ops

28 Scheduler 8-wide issue (micro-ops) 4 ALU 2 Loads 3 Store Address calculations (STA) [overlapped with 2 loads] 1 Store Data (STD) 2 Branch Execution units Floating Point 2 FMA 2 FP MUL SIMD 3 SIMD If a port is not available, then instructions will wait even if independent Instruction ordering is important!

29 Conclusion Out-of-order processors execute instructions when ready exploit instruction-level parallelism (ILP) No longer subject to head-of-line blocking long latency instructions are okay if they are off the critical path Highly sensitive to branch behaviour processors use speculation to schedule beyond branches branches can be very expensive if mispredicted

A Fast Instruction Set Simulator for RISC-V

A Fast Instruction Set Simulator for RISC-V Maxim.Maslov@esperantotech.com Vadim.Gimpelson@esperantotech.com Nikita.Voronov@esperantotech.com Dave.Ditzel@esperantotech.com Esperanto Technologies, Inc.