A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening Alberto Magni, Christophe Dubach, Michael O'Boyle

Introduction Wide adoption of GPGPU for HPC Many GPU devices from many of vendors AMD Nvidia Intel Qualcomm ARM 2

Introduction OpenCL AMD Nvidia Intel Qualcomm ARM 3

OpenCL is Functionally Portable OpenCL AMD Nvidia Proprietary compiler Intel Qualcomm ARM 4

Performance Evaluation Regression Trees Thread Coarsening Loads < 0.92 T 1.70 F Branches < 2.80 T Cache Misses < 1.00 T 1.06 F 0.40 F 0.81 5

Performance is NOT Portable Nbody AMD Cypress Nvidia Fermi 6

What's next Motivation Compiler Infrastructure Thread-Coarsening Experiments Data Analysis Conclusions and Future Work 7

Thread Coarsening Original Thread Space Reduce thread number Transformed Thread Space Increase amount of work 8

Advantages of Thread Coarsening Reduce the amount of redundant computation Perfect for a cross-architectural evaluation Supported by the standard Architecture independent 9

Our Portable Compiler 10

Thread Coarsening Implementation LLVM function pass replicates the instructions in the kernel body id = 0 id = 1 for index in (0 : width) tmp += A[id + index]; B[id] = tmp; for index in (0 : width) tmp += A[id + index]; B[id] = tmp; 11

Thread Coarsening Implementation Identify divergent instructions id = 0 id = 1 for index in (0 : width) tmp += A[id + index]; B[id] = tmp; for index in (0 : width) tmp += A[id + index]; B[id] = tmp; 12

Thread Coarsening Implementation Replicate divergent instructions id = 0 for index in (0 : width) tmp1 += A[id + index] tmp2 += A[2*id + 1 + index] B[id] = tmp1 B[2*id + 1] = tmp2 Thread number is reduced at runtime 13

What's next Motivation Compiler Infrastructure Thread-Coarsening Experiments Data Analysis Conclusions and Future Work 14

Parameter Space Static Parameters Coarsening Factor Stride Direction Dynamic Parameters Local Work Group Size ~300 configs for One-D benchmarks ~2,000 configs for Two-D benchmarks 15

Experimental Set-Up 17 benchmarks from Nvidia / AMD / Parboil 5 Devices: Nvidia Fermi GTX480 Nvidia Kepler K20 AMD Cypress HD 5900 AMD Tahiti 7970 Intel Core-i7 ~ 43,000 runs in Total 16

Experimental Set-Up 17 benchmarks from Nvidia / AMD / Parboil 5 Devices: Nvidia Fermi GTX480 Nvidia Kepler K20 AMD Cypress HD 5900 AMD Tahiti 7970 Intel Core-i7 17

Performance Varies Significantly Nvidia Fermi AMD Cypress 18

Performance Varies Significantly Nvidia Fermi 19

Performance Varies Significantly AMD Cypress 20

What's next Motivation Compiler Infrastructure Thread-Coarsening Experiments Data Analysis Conclusions and Future Work 21

Data Collection Full Exploration 22

Data Collection Config Selection 23

Data Collection Profiling 24

Data Collection Profiling 25

Profiler Counters Nvidia #instructions #branches #loads #stores AMD ALU Utilization Vector Utilization WLIW Packing Cache Utilization Memory Unit Utilization Cache: L1 HitRate L2 HitRate 26

Counters Analysis GOAL: Discriminate fast and slow configs Speedup Counter Relative Value x counter > x counter < x 27

Explaining Performance per Device Device HW Speedups Counters Regression Tree counter < value... Discriminate fast and slow configs Relate counters to performance Trees are easy to read... speedups speedups 28

Tree Analysis Nvidia Fermi Loads < 0.92 T 1.70 F Branches < 2.80 T Cache Misses < 1.00 T 1.06 F 0.40 F 0.81 29

Tree Analysis Nvidia Fermi floydwarshall sgemm 30

Dynamic Counter 31

Tree Analysis Nvidia Fermi floydwarshall sgemm spmv mvcoal 32

Dynamic Counter 33

Trees Analysis AMD Cypress spmv stencil nbody BinarySearch mt 34

Dynamic Counter 35

What's next Motivation Compiler Infrastructure Thread-Coarsening Experiments Data Analysis Conclusions and Future Work 36

Conclusion and Future Work Automatic methodology for performance explanation First step toward definition of compiler heuristics and automatic coarsening tuning Loads < 0.92 T 1.70 ALUPacking < 1.28 F T T Branches < 2.80 Cache Misses < 1.00 T 1.06 F 0.81 F 0.8 F ALUBusy < 0.59 T 0.40 0.79 F 2.10 37