A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening Alberto Magni, Christophe Dubach, Michael O'Boyle
Introduction Wide adoption of GPGPU for HPC Many GPU devices from many of vendors AMD Nvidia Intel Qualcomm ARM 2
Introduction OpenCL AMD Nvidia Intel Qualcomm ARM 3
OpenCL is Functionally Portable OpenCL AMD Nvidia Proprietary compiler Intel Qualcomm ARM 4
Performance Evaluation Regression Trees Thread Coarsening Loads < 0.92 T 1.70 F Branches < 2.80 T Cache Misses < 1.00 T 1.06 F 0.40 F 0.81 5
Performance is NOT Portable Nbody AMD Cypress Nvidia Fermi 6
What's next Motivation Compiler Infrastructure Thread-Coarsening Experiments Data Analysis Conclusions and Future Work 7
Thread Coarsening Original Thread Space Reduce thread number Transformed Thread Space Increase amount of work 8
Advantages of Thread Coarsening Reduce the amount of redundant computation Perfect for a cross-architectural evaluation Supported by the standard Architecture independent 9
Our Portable Compiler 10
Thread Coarsening Implementation LLVM function pass replicates the instructions in the kernel body id = 0 id = 1 for index in (0 : width) tmp += A[id + index]; B[id] = tmp; for index in (0 : width) tmp += A[id + index]; B[id] = tmp; 11
Thread Coarsening Implementation Identify divergent instructions id = 0 id = 1 for index in (0 : width) tmp += A[id + index]; B[id] = tmp; for index in (0 : width) tmp += A[id + index]; B[id] = tmp; 12
Thread Coarsening Implementation Replicate divergent instructions id = 0 for index in (0 : width) tmp1 += A[id + index] tmp2 += A[2*id + 1 + index] B[id] = tmp1 B[2*id + 1] = tmp2 Thread number is reduced at runtime 13
What's next Motivation Compiler Infrastructure Thread-Coarsening Experiments Data Analysis Conclusions and Future Work 14
Parameter Space Static Parameters Coarsening Factor Stride Direction Dynamic Parameters Local Work Group Size ~300 configs for One-D benchmarks ~2,000 configs for Two-D benchmarks 15
Experimental Set-Up 17 benchmarks from Nvidia / AMD / Parboil 5 Devices: Nvidia Fermi GTX480 Nvidia Kepler K20 AMD Cypress HD 5900 AMD Tahiti 7970 Intel Core-i7 ~ 43,000 runs in Total 16
Experimental Set-Up 17 benchmarks from Nvidia / AMD / Parboil 5 Devices: Nvidia Fermi GTX480 Nvidia Kepler K20 AMD Cypress HD 5900 AMD Tahiti 7970 Intel Core-i7 17
Performance Varies Significantly Nvidia Fermi AMD Cypress 18
Performance Varies Significantly Nvidia Fermi 19
Performance Varies Significantly AMD Cypress 20
What's next Motivation Compiler Infrastructure Thread-Coarsening Experiments Data Analysis Conclusions and Future Work 21
Data Collection Full Exploration 22
Data Collection Config Selection 23
Data Collection Profiling 24
Data Collection Profiling 25
Profiler Counters Nvidia #instructions #branches #loads #stores AMD ALU Utilization Vector Utilization WLIW Packing Cache Utilization Memory Unit Utilization Cache: L1 HitRate L2 HitRate 26
Counters Analysis GOAL: Discriminate fast and slow configs Speedup Counter Relative Value x counter > x counter < x 27
Explaining Performance per Device Device HW Speedups Counters Regression Tree counter < value... Discriminate fast and slow configs Relate counters to performance Trees are easy to read... speedups speedups 28
Tree Analysis Nvidia Fermi Loads < 0.92 T 1.70 F Branches < 2.80 T Cache Misses < 1.00 T 1.06 F 0.40 F 0.81 29
Tree Analysis Nvidia Fermi floydwarshall sgemm 30
Dynamic Counter 31
Tree Analysis Nvidia Fermi floydwarshall sgemm spmv mvcoal 32
Dynamic Counter 33
Trees Analysis AMD Cypress spmv stencil nbody BinarySearch mt 34
Dynamic Counter 35
What's next Motivation Compiler Infrastructure Thread-Coarsening Experiments Data Analysis Conclusions and Future Work 36
Conclusion and Future Work Automatic methodology for performance explanation First step toward definition of compiler heuristics and automatic coarsening tuning Loads < 0.92 T 1.70 ALUPacking < 1.28 F T T Branches < 2.80 Cache Misses < 1.00 T 1.06 F 0.81 F 0.8 F ALUBusy < 0.59 T 0.40 0.79 F 2.10 37