Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor

Size: px

Start display at page:

Download "Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor"

Lionel Long
6 years ago
Views:

Jose, USA May 11, 2017 1 Shanghai Jiao Tong University, Center for HPC 2 Princeton University, Institute for

1 Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor Stephen Wang 1, James Lin 1,4, William Tang 2, Stephane Ethier 2, Bei Wang 2, Simon See 1,3 and Satoshi Matsuoka 4 GTC 2017, San Jose, USA May 11, Shanghai Jiao Tong University, Center for HPC 2 Princeton University, Institute for Computational Science & Engineering (PICSciE) and Plasma Physics Laboratory(PPPL) 3 NVIDIA corporation 4 Tokyo Institute of Technology 1

2 Challenges of supporting multi- and many-cores, the territory of OpenMP Core Number

GTC-P: Gyrokinetic Toroidal Code - Princeton Developed by Princeton to accelerate progress in highly-scalable plasma turbulence HPC Particle-in-Cell (PIC) codes Successfully applied to

3 GTC-P: Gyrokinetic Toroidal Code - Princeton Developed by Princeton to accelerate progress in highly-scalable plasma turbulence HPC Particle-in-Cell (PIC) codes Successfully applied to high-resolution problem-size-scaling studies relevant to the Fusion s next-generation International Thermonuclear Experimental Reactor (ITER). Modern co-design version of the comprehensive original GTC code with focus on using Computer Science performance modeling to improve basic PIC operations to deliver simulations at extreme scales with unprecedented resolution & speed on variety of different architectures worldwide Includes present-day multi-petaflop supercomputers, including Tianhe-2, Titan, Sequoia, Mira, etc., that feature GPU, CPU multicore, and manycore processors KEY REFERENCE: W. Tang, B. Wang, S. Ethier, G. Kwasniewski, T. Hoefler and etc., Extreme Scale Plasma Turbulence Simulations on Top Supercomputers Worldwide, Supercomputing (SC), 2016 Conference, Salt Lake City, Utah, USA 3

4 OpenACC Implementations Challenges a. Memory-bound kernels b. Data hazard c. Random memory access hotspots Implementations a. Increase memory bandwidth b. Use atomic operations c. Take advantage of local memory Six Major Subroutines of GTC-P 4

5 OpenACC Implementations present directive 5

6 OpenACC Implementations atomic directive 6

7 Run the single OpenACC code base: huge performance gap on x86 and Sunway GPU (NVIDIA K20) Baseline: CUDA OpenACC Elapsed Time (s) x slower x86 multicore (Intel SNB) Baseline: OpenMP OpenACC Elapsed Time (s) x slower! OpenMP allocates the array copy on each thread and reduce, without atomic operations. Sunway many-core (SW 26010) Baseline: Serial code on 1 MPE OpenACC code on 64 CPE Elapsed Time (s) x slower!!! unacceptable 7

8 Our solution for multi- and many-core: using thread-id to duplicate copies for reduction to replace the Fetch-and-Add atomic operation array[thread-id][n] - copy for T1 array[thread-id][n] - copy for T2 Data Hazard array[n] array[thread-id][n] - copy for T3 Reduction (Add) T1 T2 T3 T4 Irregular Memory Access (Fetch-and-Add) array[thread-id][n] - copy for T4 8

9 Performance w/o atomic operations on x86 CPU Thread ID is not supported for x86 in OpenACC standard yet. Baseline Private function in PGI compiler is used here: pgi_blockidx() PGI compiler

10 Implementation on Sunway many-core processor: a customized thread-id extension available from Sunway OpenACC Architecture overview of SW26010 acc_thread_id is a customized extension provided in Sunway OpenACC 10

Optimization on Sunway many-core processor: data locality in 64KB Scratch Pad Memory Using tile directive to coalesced access data by per DMA request.

11 Optimization on Sunway many-core processor: data locality in 64KB Scratch Pad Memory Using tile directive to coalesced access data by per DMA request. The optimum tile size can take full usage of 64KB SPM. Elapsed Time(s) Lower is better SPM Memory hierarchy of CPE tile_size Keep data in SPM instead of global memory access. 11

12 (*) Optimization on Sunway many-core processor OpenACC code swacc (S2S compiler) 256-bit SIMD intrinsic immediate code (.host and.slave) -keep or -dumpcommand can let compiler generate it. sw5cc (native compiler) This part in push kernel can achieve 5.6x speedup. Execution file But the cost of this kernel is too small compared with the entire GTC-P code. 12

13 Performance on Sunway many-core processor Lower is better Shift Smooth Field Poisson Push Charge Avoid atomic operations. Elapsed time [sec] Baseline 1.1X Increase DMA bandwidth Strengthen data locality in SPM X (*) In-build SIMD code 0 Sequential (MPE) OpenACC (CPE) +w/o atomics +Tile +SPM library 13

14 Performance and portability of GTC-P on GPU 14

15 Use native atomic instructions on P100 Native atomic instructions (FP64) are supported on Pascal architecture. Compare the PTX code generated by PGI compiler on K80 and P

16 OpenACC version of GTC-P on K80 and P100 Performance of OpenACC version on P100 is close to CUDA code due to the better atomic instructions support. OpenACC benefit from the hardware support on the latest GPU architecture. 16

17 Use specific algorithm for GPU in OpenACC code Remove auxiliary array which use to store the 4 points 17

Performance results of OpenACC version with new algorithm on GPU Tesla K40 GPU B 100 1 2 * GPU B 100 1 4 * GPU CUDA

18 Performance results of OpenACC version with new algorithm on GPU Tesla K40 GPU B * GPU B * GPU CUDA OpenACC new OpenACC 1399MB/GPU 3070MB/GPU 1501MB/GPU 742MB/GPU 1569MB/GPU 785MB/GPU 50% device memory usage reduce 18

19 Core Number Hardware support for key operations Gap of memory hierarchy 19

20 Summary Optimizations for specific architecture are necessary to reasonable performance in GTC-P code. Native atomic support on GPU can achieve better performance of OpenACC code compared with the same operations on multi- and many-core now. The gap of memory hierarchy between different architectures may cause different algorithm for OpenACC code. 20

21 Reference Stephen Wang, James Lin, Linjin Cai, William Tang, Stephane Ethier, Bei Wang, Simon See and Satoshi Matsuoka. Porting and Optimizing GTC-P on TaihuLight Supercomputer with Sunway OpenACC. HPC China, Best Paper Award (Acceptance Rate < 3%) Yueming Wei, Stephen Wang, Linjin Cai, William Tang, Bei Wang, Stephane Ethier, Simon See and James Lin. Performance and Portability Studies with OpenACC Accelerated Version of GTC-P. PDCAT,

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P Stephen Wang 1, James Lin 1, William Tang 2, Stephane Ethier 2, Bei Wang 2, Simon See