A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

Size: px

Start display at page:

Download "A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle"

Jeremy Owen
5 years ago
Views:

1 A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening Alberto Magni, Christophe Dubach, Michael O'Boyle

2 Introduction Wide adoption of GPGPU for HPC Many GPU devices from many of vendors AMD Nvidia Intel Qualcomm ARM 2

3 Introduction OpenCL AMD Nvidia Intel Qualcomm ARM 3

4 OpenCL is Functionally Portable OpenCL AMD Nvidia Proprietary compiler Intel Qualcomm ARM 4

5 Performance Evaluation Regression Trees Thread Coarsening Loads < 0.92 T 1.70 F Branches < 2.80 T Cache Misses < 1.00 T 1.06 F 0.40 F

6 Performance is NOT Portable Nbody AMD Cypress Nvidia Fermi 6

7 What's next Motivation Compiler Infrastructure Thread-Coarsening Experiments Data Analysis Conclusions and Future Work 7

8 Thread Coarsening Original Thread Space Reduce thread number Transformed Thread Space Increase amount of work 8

9 Advantages of Thread Coarsening Reduce the amount of redundant computation Perfect for a cross-architectural evaluation Supported by the standard Architecture independent 9

10 Our Portable Compiler 10

11 Thread Coarsening Implementation LLVM function pass replicates the instructions in the kernel body id = 0 id = 1 for index in (0 : width) tmp += A[id + index]; B[id] = tmp; for index in (0 : width) tmp += A[id + index]; B[id] = tmp; 11

12 Thread Coarsening Implementation Identify divergent instructions id = 0 id = 1 for index in (0 : width) tmp += A[id + index]; B[id] = tmp; for index in (0 : width) tmp += A[id + index]; B[id] = tmp; 12

13 Thread Coarsening Implementation Replicate divergent instructions id = 0 for index in (0 : width) tmp1 += A[id + index] tmp2 += A[2*id index] B[id] = tmp1 B[2*id + 1] = tmp2 Thread number is reduced at runtime 13

14 What's next Motivation Compiler Infrastructure Thread-Coarsening Experiments Data Analysis Conclusions and Future Work 14

15 Parameter Space Static Parameters Coarsening Factor Stride Direction Dynamic Parameters Local Work Group Size ~300 configs for One-D benchmarks ~2,000 configs for Two-D benchmarks 15

16 Experimental Set-Up 17 benchmarks from Nvidia / AMD / Parboil 5 Devices: Nvidia Fermi GTX480 Nvidia Kepler K20 AMD Cypress HD 5900 AMD Tahiti 7970 Intel Core-i7 ~ 43,000 runs in Total 16

17 Experimental Set-Up 17 benchmarks from Nvidia / AMD / Parboil 5 Devices: Nvidia Fermi GTX480 Nvidia Kepler K20 AMD Cypress HD 5900 AMD Tahiti 7970 Intel Core-i7 17

18 Performance Varies Significantly Nvidia Fermi AMD Cypress 18

19 Performance Varies Significantly Nvidia Fermi 19

20 Performance Varies Significantly AMD Cypress 20

21 What's next Motivation Compiler Infrastructure Thread-Coarsening Experiments Data Analysis Conclusions and Future Work 21

22 Data Collection Full Exploration 22

23 Data Collection Config Selection 23

24 Data Collection Profiling 24

25 Data Collection Profiling 25

26 Profiler Counters Nvidia #instructions #branches #loads #stores AMD ALU Utilization Vector Utilization WLIW Packing Cache Utilization Memory Unit Utilization Cache: L1 HitRate L2 HitRate 26

27 Counters Analysis GOAL: Discriminate fast and slow configs Speedup Counter Relative Value x counter > x counter < x 27

28 Explaining Performance per Device Device HW Speedups Counters Regression Tree counter < value... Discriminate fast and slow configs Relate counters to performance Trees are easy to read... speedups speedups 28

29 Tree Analysis Nvidia Fermi Loads < 0.92 T 1.70 F Branches < 2.80 T Cache Misses < 1.00 T 1.06 F 0.40 F

30 Tree Analysis Nvidia Fermi floydwarshall sgemm 30

31 Dynamic Counter 31

32 Tree Analysis Nvidia Fermi floydwarshall sgemm spmv mvcoal 32

33 Dynamic Counter 33

34 Trees Analysis AMD Cypress spmv stencil nbody BinarySearch mt 34

35 Dynamic Counter 35

36 What's next Motivation Compiler Infrastructure Thread-Coarsening Experiments Data Analysis Conclusions and Future Work 36

37 Conclusion and Future Work Automatic methodology for performance explanation First step toward definition of compiler heuristics and automatic coarsening tuning Loads < 0.92 T 1.70 ALUPacking < 1.28 F T T Branches < 2.80 Cache Misses < 1.00 T 1.06 F 0.81 F 0.8 F ALUBusy < 0.59 T F

Analysis and Parameter Prediction of Compiler Transformation for Graphics Processors. Alberto Magni

Analysis and Parameter Prediction of Compiler Transformation for Graphics Processors Alberto Magni Doctor of Philosophy Institute of Computing Systems Architecture School of Informatics University of Edinburgh