Yunsup Lee UC Berkeley 1

Size: px

Start display at page:

Download "Yunsup Lee UC Berkeley 1"

Shon Welch
5 years ago
Views:

1 Yunsup Lee UC Berkeley 1

2 Why is Supporting Control Flow Challenging in Data-Parallel Architectures? for (i=0; i<16; i++) { a[i] = op0; b[i] = op1; if (a[i] < b[i]) { c[i] = op2; else { c[i] = op3; d[i] = op4; Thread0 Thread4 Thread8 Thread12 Data-Parallel Issue Unit Thread1 Thread2 Thread5 Thread6 Thread9 Thread10 Thread13 Thread14 Thread3 Thread7 Thread11 Thread15 The divergence management architecture must not only partially sequence all execution paths for correctness, but also reconverge threads from different execution paths for efficiency. Yunsup Lee UC Berkeley 2

3 For GPUs, Supporting Complex Control Flow in a SPMD Program is not Optional! for (i=0; i<16; i++) { if (a[i] < b[i]) { if (f[i]) { c[i]->vfunc(); goto SKIP; d[i] = op; SKIP: Traditional vector compilers can always give up and run complex control flow on the control processor. However, for the SPMD compiler, supporting complex control flow is a functional requirement rather than an optional performance optimization. Kernel() { if (a[tid] < b[tid]) { if (f[tid]) { c[tid]->vfunc(); goto SKIP; d[tid] = op; SKIP: Kernel<<<16>>>(); Yunsup Lee UC Berkeley 3

(Used in Limited Cases) We can perform a design space exploration of divergence management on

4 Design Space of Divergence Management Software Explicitly Scheduled by Compiler Hardware Implicitly Managed by Microarchitecture Vector Predication Predication + Fork/Join Predication (Used in Limited Cases) We can perform a design space exploration of divergence management on NVIDIA GPU silicon Divergence Stack (Compiler Figures Out Reconvergence Points) Yunsup Lee UC Berkeley 4

5 Executive Summary, Contributions of Paper 28 Benchmarks written in CUDA (Parboil, Rodinia, FFT, nqueens) CUDA 6.5 Production Compiler Divergence Stack Limited Predication CUDA Binary Modified CUDA 6.5 Production Compiler Full Predication with New Compiler Algorithms CUDA Binary (1) Detailed Explanation and Categorization of Hardware and Software Divergence Management Schemes (2) SPMD Predication Compiler Algorithms (3) Apples-to-Apples Comparison using Production Silicon and Compiler NVIDIA Tesla K20c (Kepler, GK110) Performance, Statistics Performance, Statistics Performance with predication is on par compared to performance with divergence stack Yunsup Lee UC Berkeley 5

6 How does the hardware divergence stack and software predication handle control flow? Yunsup Lee UC Berkeley 6

7 If-Then-Else Example: Divergence Stack CUDA Program LLVM Compiler PTX ptxas Backend Compiler SASS Kernel() { a = op0; b = op1; if (a < b) { c = op2; else { c = op3; d = op4; Kernel<<<n>>>(); LLVM compiler takes a CUDA program and generates PTX, which encodes all data/control dependences Note, all instructions on the right (despite being a scalar instruction) are executed in a SIMD fashion a = op0 b = op1 p = slt a,b branch.eqz p, else c = op2 j ipdom else: c = op3 ipdom: d = op4 Yunsup Lee UC Berkeley 7

8 If-Then-Else Example: Divergence Stack CUDA Program LLVM Compiler PTX ptxas Backend Compiler SASS a = op0 b = op1 p = slt a,b push ipdom branch.eqz p, else c = op2.pop else: c = op3.pop ipdom: d = op4 ptxas backend compiler takes PTX and generates SASS instructions, which executes natively on GPU Reconvergence points are analyzed and inserted by the backend compiler Kernel() { a = op0; b = op1; if (a < b) { c = op2; else { c = op3; d = op4; Kernel<<<n>>>(); Yunsup Lee UC Berkeley 8

9 If-Then-Else Example: Divergence Stack Mask OP 1111 op op slt push 1111 branch 1100 op2 pop 0011 op3 pop 1111 op4 Assume 4 threads are executing. Assume thread 0 and 1 took the branch. Divergence Stack PC PC MASK MASK ipdom 1111 PC MASK else 0011 ipdom 1111 a = op0 b = op1 p = slt a,b push ipdom branch.eqz p, else c = op2.pop else: c = op3.pop ipdom: d = op4 Push: pushes <reconverge pc, current mask> to stack Pop: disregards current <pc, mask>, pops top of stack, and executes deferred <pc, mask> Yunsup Lee UC Berkeley 9

10 If-Then-Else Example: Predication Predication Kernel() { a = op0; b = op1; if (a < b) { c = op2; else { c = op3; d = op4; Kernel<<<n>>>(); a = op0 b = op1 f0 = slt c = c = op3 d = op4 The compiler can schedule instructions Predicates also encode reconvergence information Yunsup Lee UC Berkeley 10

11 Uniform Branch Conditions Across All Threads Divergence Stack a = op0 b = op1 p = slt a,b push ipdom branch.eqz p, else c = op2.pop else: c = op3.pop ipdom: d = op4 Predication a = op0 b = op1 f0 = slt c = c = op3 d = op4 What if the branch condition is uniform across all threads? Execute instructions with a null predicate Branches don t push a token to the divergence stack when branch condition is uniform. Yunsup Lee UC Berkeley 11

12 Runtime Branch-Uniformity Optimization with Consensual Branches Kernel() { a = op0; b = op1; if (a < b) { c = op2; else { c = op3; d = op4; Kernel<<<n>>>(); a = op0 b = op1 f0 = slt a,b cbranch.ifnull f0, c = op2 else: cbranch.ifnull!f0, c = op3 ipdom: d = op4 Thread-Aware Predication We can optimize the predicated code with a consensual branch (cbranch), which is taken only when all threads consensually agree on the branch condition (ifnull). The code may jump around unnecessary work. Yunsup Lee UC Berkeley 12

13 Static Branch-Uniformity Optimization with Consensual Branches Kernel() { a = op0; b = op1; if (a < b) { c = op2; else { c = op3; d = op4; Kernel<<<n>>>(); a = op0 b = op1 f0 = slt a,b cbranch.ifnull f0, else c = op2 else: cbranch.ifnull!f0, ipdom c = op3 ipdom: d = op4 Thread-Aware Predication If the compiler can prove that the branch condition is uniform across all threads, the compiler can omit the guard predicates. Yunsup Lee UC Berkeley 13

14 Loop Example: Consensual Branches are Key to Compile Loops with Predication Kernel() { done = false; while (!done) { a = op0; b = op1; done = a < b; c = op2; Kernel<<<n>>>(); f0 = true loop: cbranch.ifnull f0, a = b = f1 = slt a, b f0 = and f0,!f1 j loop exit: c = op2 Thread-Aware Predication Intuitively, the compiler needs to sequence loop until all threads are done executing the loop. A consensual branch (cbranch.ifnull) is used to check whether loop mask (f0) is null. Yunsup Lee UC Berkeley 14

15 Thread-Aware (TA) Predication Compiler Algorithms Yunsup Lee UC Berkeley 15

16 Predication Compiler Algorithms Generate CFG, CDG Walk CDG to Get Guard Predicates for all BBs Linearize Control Flow Predicate all instructions with guard predicate and rewire all BBs Kernel() { N1; N2; if (!P1) { N3; else { N4; if (!P2) { N5; else { N6; N7; N8; Kernel<<<n>>>(); Control Flow Graph (CFG) N6 N7 N1 N2 P1!P1 N4 P2!P2 N5 N8 N3 N1 N8 N2 Control Dependence Graph (CDG) P1!P1 N4 N7 N3 P2!P2 N6 Yunsup Lee UC Berkeley 16 N5 Guard Predicate = P1 &&!P2

17 Runtime Branch-Uniformity Optimization Generate CFG, CDG Kernel() { N1; N2; if (!P1) { N3; else { N4; if (!P2) { N5; else { N6; N7; N8; Kernel<<<n>>>(); Walk CDG to Get Guard Predicates for all BBs CFG Assume compiler cannot prove that P1 is uniform across all threads N6 Add consensual branch if P1 is null N7 N4 N5 N1 N2 P1!P1 P2!P2 Linearize Control Flow N8 N3 Predicate all instructions with guard predicate and rewire all BBs N1 N8 N2 Control Dependence Graph (CDG) P1!P1 N4 N7 N3 P2!P2 N6 Add consensual branch if!p1 is null Yunsup Lee UC Berkeley 17 N5

18 Static Branch-Uniformity Optimization Generate CFG, CDG Walk CDG to Get Guard Predicates for all BBs Linearize Control Flow Kernel() { CFG N1 N1; N2; Assume compiler if (!P1) { can prove that N3; N2 P1 is uniform else { across all threads P1!P1 N4; if (!P2) { N4 N5; else { P2!P2 N6; N6 N5 N7; N7 N3 N8; Kernel<<<n>>>(); N8 Predicate all instructions with guard predicate and rewire all BBs N1 N8 N2 Control Dependence Graph (CDG) P1!P1 N4 N7 N3 P2!P2 N6 Yunsup Lee UC Berkeley 18 N5 Guard Predicate = P2

19 Predication Compiler Algorithms: Loops N1 N1 N3 N2 N2 N11 Kernel() { if (!P1) { while (!P2) { if (!P3) { break; Kernel<<<n>>>(); N12 Exit Mask 2 CFG N3 N6 N8 N9 N10 N4 N5 N11 Loop Mask Exit Mask 1 N7 N12 P1!P1 Thread-Aware Control Dependence Graph (CDG) N10 Yunsup Lee UC Berkeley 19 N4 L1 N5 P2 E1!P2 N7 N6 N8 P3!P3 E2 N9

20 Supporting Complex Control Flow Function Calls Support by a straightforward calling convention Virtual Function jalr r3 With Divergence Stack loop: p0, r4 = find_unique p2, jalr r4 p2 = p2 and!p0 cbranch.ifany p2, loop With Predication Irreducible Control Flow Find smallest region containing irreducible control flow, and insert sequencing code at the entry block and exit block to sequence active threads through the region one-by-one. Yunsup Lee UC Berkeley 20

21 Evaluation Yunsup Lee UC Berkeley 21

22 Predication CUDA Compiler CUDA Program LLVM Compiler Predication Passes, Generates a throw-away pseudo PTX instruction with predication information Annotated PTX Theoretically we could have implemented our thread-aware predication pass in ptxas Implemented bulk of the predication pass in LLVM for fast prototyping ptxas Backend Compiler Runs existing optimizations Then predicates with information retrieved from pseudo PTX insts. SASS Yunsup Lee UC Berkeley 22

23 Evaluation: Quick Recap 28 Benchmarks written in CUDA (Parboil, Rodinia, FFT, nqueens) CUDA 6.5 Production Compiler Divergence Stack Limited Predication NVIDIA Tesla K20c (Kepler, GK110) CUDA Binary Performance, Statistics Modified CUDA 6.5 Production Compiler Full Predication with TA Compiler Algorithms CUDA Binary Performance, Statistics Five Bars 1) Baseline (Divergence Stack) 2) Limited Predication 3) TA Predication 4) TA+SBU (static branch-uniformity optimization) 5) TA+SBU+RBU (runtime branch-uniformity optimization) Compare performance and statistics Yunsup Lee UC Berkeley 23

24 Performance Results: Geomean Baseline with divergence stack Divergence Stack+Limited Predication TA TA+SBU TA+SBU+RBU Speedup geomean r-pathfinder p-cutcp p-sgemm p-sad p-mri-q p-stencil r-nn p-lbm fft r-backprop nqueens r-b+tree r-hotspot r-srad-v2 p-gridding p-tpacf p-spmv r-gaussian p-histo p-bfs r-srad-v1 r-scluster r-lud r-bfs Thread-aware predication compiler is competitive with the baseline compiler (divergence stack) Both static and runtime branch-uniformity optimizations play an important role Performance doesn t change with limited if-conversion heuristic implemented in production compiler Yunsup Lee UC Berkeley 24

Performance Results: Speedups Baseline with divergence stack Divergence Stack+Limited Predication TA TA+SBU TA+SBU+RBU Speedup 1.5 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 geomean 1.

25 Performance Results: Speedups Baseline with divergence stack Divergence Stack+Limited Predication TA TA+SBU TA+SBU+RBU Speedup geomean r-pathfinder p-cutcp p-sgemm p-sad p-mri-q p-stencil r-nn p-lbm fft r-backprop nqueens r-b+tree r-hotspot r-srad-v2 p-gridding p-tpacf p-spmv r-gaussian p-histo p-bfs r-srad-v1 r-scluster r-lud r-bfs Predication can expose better scheduling opportunities Extra consensual branches added for +RBU may act as scheduling barriers Yunsup Lee UC Berkeley 25

26 Performance Results: Slowdowns Baseline with divergence stack Divergence Stack+Limited Predication TA TA+SBU TA+SBU+RBU Speedup geomean r-pathfinder p-cutcp p-sgemm p-sad p-mri-q p-stencil r-nn p-lbm fft r-backprop nqueens r-b+tree r-hotspot r-srad-v2 p-gridding p-tpacf p-spmv r-gaussian p-histo p-bfs r-srad-v1 r-scluster r-lud r-bfs Ten benchmarks are inconclusive (>90%, <100%) Yunsup Lee UC Berkeley 26

27 Performance Results: Slowdowns Baseline with divergence stack Divergence Stack+Limited Predication TA TA+SBU TA+SBU+RBU Speedup geomean r-pathfinder p-cutcp p-sgemm p-sad p-mri-q p-stencil r-nn p-lbm fft r-backprop nqueens r-b+tree r-hotspot r-srad-v2 p-gridding p-tpacf p-spmv r-gaussian p-histo p-bfs r-srad-v1 r-scluster r-lud r-bfs Ten benchmarks are inconclusive (>90%, <100%) Five benchmarks in <90% range Increase in register pressure can reduce occupancy, which sometimes reduce performance Compiler is not able to optimize for all branch uniformity exhibited during runtime Yunsup Lee UC Berkeley 27

28 Discussion on Area, Power, Energy Hard to quantify impact of predication on area, power, energy Experiment done on GPU silicon Primary motivation of software divergence management is to reduce hardware design complexity and associated verification costs Area and power overhead of divergence stack are not significant Performance power/energy consumption Power/Energy consumption of software divergence mgmt. Power/Energy of hardware divergence mgmt. Yunsup Lee UC Berkeley 28

29 Fundamental Advantages of Software Divergence Management N6 N1 N2 N3 Short-circuit example Divergence stack cannot reconverge threads at N4 Predication can reconverge threads at N4 N4 N7 N5 Control Flow Graph Yunsup Lee UC Berkeley 29

30 Conclusions Advantages of Divergence Stack Enables a fairly conventional thread compilation model Register allocation easier Simplifies the task of supporting irreducible control flow Advantages of Predication Simplifies the hardware without sacrificing programmability Actual cases where predication can outperform divergence stack Better scheduling opportunities Better reconvergence of threads For divergence management, pushing complexity to the compiler is a better choice This work was funded by DARPA award HR , the Center for Future Architecture Research, a member of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA, an NVIDIA graduate fellowship, and ASPIRE Lab industrial sponsors and affiliates Intel, Google, Nokia, NVIDIA, Oracle, and Samsung. It was also funded by DOE contract B Any opinions, findings, conclusions, or recommendations in this paper are solely those of the authors and do not necessarily reflect the position or the policy of the sponsors. Yunsup Lee UC Berkeley 30

Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures

Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures Yunsup Lee, Vinod Grover, Ronny Krashinsky, Mark Stephenson, Stephen W. Keckler, Krste Asanović University of California,