Algorithms for Auto- tuning OpenACC Accelerated Kernels

Size: px

Start display at page:

Download "Algorithms for Auto- tuning OpenACC Accelerated Kernels"

Scarlett Russell
5 years ago
Views:

1 Outline Algorithms for Auto- tuning OpenACC Accelerated Kernels Fatemah Al- Zayer 1, Ameerah Al- Mu2ry 1, Mona Al- Shahrani 1, Saber Feki 2, and David Keyes 1 1 Extreme Compu,ng Research Center, 2 KAUST Supercompu,ng Laboratory, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia GTC 2016 San Jose- CA April 6 th 2016 GPU Technology Conference 2016

2 Outline Mo,va,on Auto- Tuning Methodology Performance Results Toward a Vector Model Preliminary Results Conclusion and Future work GPU Technology Conference

3 Tuning loops execu,ons is one important step to op,mize the performance of your OpenACC accelerated code. Gang (num_gangs) and Vector (vector_length) clauses Tile Collapse Mo,va,on Manually tuning these parameters can be tedious and,me consuming. Auto- Tuning GPU Technology Conference

4 Tuning Time! Using brute force search is very,me consuming Ø Historic Learning Approach Ø Using faster deriva,ve- free search algorithms: Random Search Simulated Annealing Gene,c Algorithm Nelder- Mead Ø A hybrid solu,on using search algorithms combined with historic learning. GPU Technology Conference

5 Automa,c Performance Tuning Method GPU Technology Conference

6 Seismic Imaging Kernel Solve the acous,c wave equa,on (Isotropic case) Finite difference scheme, 2 nd order in,me, 4 th or 8 th order in space GPU Technology Conference

7 Seismic Imaging Kernel CPU version GPU Technology Conference

8 Seismic Imaging Kernel: OpenACC Implementa,on GPU Technology Conference

9 Experimental Results Seismic kernel solving the acous,c wave equa,on, finite difference scheme 8th and 4th order in space. Performance reported on NVIDIA K20 and K40 GPUs. NVIDIA recommends the vector size to be mul,ple of warp size (32) on thus we explored the values [32,64,96,,1024] Gang values tested on increments of 2 star,ng from 2 up,ll 1024 Performance and tuning,me while using different search algorithm and in combina,on with historic learning. GPU Technology Conference

10 K20-8th order Speedup Speedup Problem Size Brute Force Random Walk Simulated Annealing Nelder- Mead Gene,c Algorithm 1 Gene,c Algorithm 2 GPU Technology Conference

11 K20-8th order Tuning,me Time (Sec) Problem Size Brute Force Random Walk Simulated Annealing Nelder- Mead Gene,c Algorithm 1 Gene,c Algorithm 2 GPU Technology Conference

12 K20-4th order Speedup Speedup Problem Size Brute Force Random Walk Simulated Annealing Nelder- Mead Gene,c Algorithm 1 Gene,c Algorithm 2 GPU Technology Conference

13 K20-4th order Tuning,me Time (Sec) Problem Size Brute Force Random Walk Simulated Annealing Nelder- Mead Gene,c Algorithm 1 Gene,c Algorithm 2 GPU Technology Conference

14 K40-8th order Speedup Speedup Problem Size Brute Force Random Walk Simulated Annealing Nelde- Mead Gene,c Algorithm 1 Gene,c Algorithm 2 GPU Technology Conference

15 K40-8th order Tuning,me Time (Sec) Problem Size Brute Force Random Walk Simulated Annealing Nelder- Mead Gene,c Algorithm 1 Gene,c Algoithm 2 GPU Technology Conference

16 K20-8th order Speedup with historical learning Brute Force Historic Learning and Brute Force Historic Learning and Random Walk Historic Learning and Nelder- Mead Historic Learning and Gene,c Algorithm Speedup x140x x150x15 120x83x402 98x418x x288x288 Problem Size GPU Technology Conference

17 K20-8th order Tuning,me historical learning Brute Force Historic Learning and Brute Force Historic Learning and Random Walk Historic Learning and Nelder- Mead Historic Learning and Gene,c Algorithm Time (Sec) x140x x150x15 120x83x402 98x418x x288x Problem Size GPU Technology Conference

18 Model for gang and vector? Can we provide a beier model to the compiler to use for selec,ng the best gang and/or vector values? For a given: Three- dimensional problem size Applica,on: e.g. 8 th order vs 4 th order GPU Specifica,on: K20 (K40 results: work in progress) Correla2ons between the problem size and the best values for gang and vector parameters. Which dimensions? GPU Technology Conference

19 Vector as func,on of Z dimension 8th order on K Best Vector Value Z - Dimension GPU Technology Conference

20 Vector Models as func,on of Z dimension Vector Value BF Vector M1 Vector M2 Vector Vector Value BF Vector M3 Vector M4 Vector Z - Dimension Z - Dimension GPU Technology Conference

21 Performance Speedup (I) Best value: average speedup over all problem sizes using (g AT,v AT ) with the auto- tuning in comparison to the compiler (g c,v c ) We report the average speedup while using the model value for vector and: Computed value of gang: (g c *v c /v M, v M ) Auto- Tuned value of gang: (g AT,v M ) Compiler value of gang: (g c,v M ) GPU Technology Conference

22 Performance Speedup (II) Model 1 Performance Speedup Model 3 and Model 4 Performance Speedup 40.00% 35.00% 30.00% 25.00% 50.00% 45.00% 40.00% 35.00% 30.00% Model 3 Model % 25.00% 15.00% 10.00% 5.00% 20.00% 15.00% 10.00% 5.00% 0.00% Auto- Tuned values of gang and vector Computed value of gang Auto- Tuned value of gang Compiler value of gang 0.00% Auto- Tuned values of gang and vector Computed value of gang Auto- Tuned value of gang Compiler value of gang Small/Medium Z dimension Larger Z dimension GPU Technology Conference

23 Performance Speedup (III) 4 th Order Kernel with 8 th order Kernel Model 1 for all Z- dimension values: 45.00% 40.00% 35.00% 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% Model 1 Performance Speedup 0.00% Auto- Tuned values of gang and vector Auto- Tuned value of gang Compiler value of gang GPU Technology Conference

24 Conclusions Auto- tuning gang and vector parameters using different algorithms results in a performance boost of the OpenACC accelerated kernels. A hybrid of historic learning and Nelder- Mead delivers the best balance of high performance and low tuning effort. Suggested a vector model that can complement the gang choice of the compiler for a near auto- tuned performance. GPU Technology Conference

25 Future Work Validate the vector model with different OpenACC kernels and correlate it with kernels profile. Extrapola,on of the vector model across architectures different NVIDIA GPUs genera,ons other accelerators Targe,ng other parameters in OpenACC 2.5 features such as the,le clause. GPU Technology Conference

Outline Algorithms for Auto- tuning OpenACC Accelerated Kernels Thank You! Saber.

26 Outline Algorithms for Auto- tuning OpenACC Accelerated Kernels Thank You! GTC 2016 San Jose- CA April 6 th 2016 GPU Technology Conference 2016

27 Thanks! GPU Technology Conference

OPENACC ONLINE COURSE 2018

OPENACC ONLINE COURSE 2018 Week 3 Loop Optimizations with OpenACC Jeff Larkin, Senior DevTech Software Engineer, NVIDIA ABOUT THIS COURSE 3 Part Introduction to OpenACC Week 1 Introduction to OpenACC Week