Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P

Size: px

Start display at page:

Download "Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P"

Megan Preston
5 years ago
Views:

2018, San Jose, USA March 27, 2018 1 Shanghai Jiao Tong University, Center for HPC 2 Princeton University,

1 Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P Stephen Wang 1, James Lin 1, William Tang 2, Stephane Ethier 2, Bei Wang 2, Simon See 1,3 GTC 2018, San Jose, USA March 27, Shanghai Jiao Tong University, Center for HPC 2 Princeton University, Institute for Computational Science & Engineering (PICSciE) and Plasma Physics Laboratory(PPPL) 3 NVIDIA corporation 1

Background Sunway TaihuLight is now the No.1 supercomputer on the Top500 list.

supercomputers. à Maintaining the single code on different supercomputers.

GPU and Sunway processors. GTC-P code is a case study.

2 Background Sunway TaihuLight is now the No.1 supercomputer on the Top500 list. In the near future, Summit in ORNL will be the next leap in the leadership-class supercomputers. à Maintaining the single code on different supercomputers. The real-world applications with OpenACC can achieve the portability across NVIDIA GPU and Sunway processors. GTC-P code is a case study. à We proposed to analyze the performance gap between the OpenACC version and the native programming approach on two different architectures. 2

GTC-P: Gyrokinetic Toroidal Code - Princeton Developed by Princeton to accelerate progress in highly-scalable plasma turbulence HPC Particle-in-Cell (PIC) codes Modern co-design version of the

3 GTC-P: Gyrokinetic Toroidal Code - Princeton Developed by Princeton to accelerate progress in highly-scalable plasma turbulence HPC Particle-in-Cell (PIC) codes Modern co-design version of the comprehensive original GTC code with focus on using Computer Science performance modeling to improve basic PIC operations to deliver simulations at extreme scales with unprecedented resolution & speed on variety of different architectures worldwide Includes present-day multi-petaflop supercomputers, including Tianhe-2, Titan, Sequoia, Mira, etc., that feature GPU, CPU multicore, and many-core processors KEY REFERENCE: W. Tang, B. Wang, S. Ethier, G. Kwasniewski, T. Hoefler and etc., Extreme Scale Plasma Turbulence Simulations on Top Supercomputers Worldwide, Supercomputing (SC), 2016 Conference, Salt Lake City, Utah, USA

4 The case study of GTC-P code with OpenACC Charge: particle to grid interpolation (SCATTER) Smooth/Poisson/Field: grid work (local stencil) Push: grid to particle interpolation (GATHER) update position and velocity Shift: in distributed memory environment, exchange particles among processors 4

5 The case study of GTC-P code with OpenACC Challenges a. Memory-bound kernels b. Data hazard c. Random memory access Methodology a. Decrease the memory bandwidth b. Use atomic operations or duplication and reduction c. Take full advantage of local memory 5

6 The performance of atomic operations on P100 and SW26010 NVIDIA GPU (P100) CUDA OpenACC Elapsed Time (s) CUDA supports global atomics in a coalesced way by transposing in shared memory Sunway processor (SW26010) Serial code on 1 MPE OpenACC code on 64 CPE Elapsed Time (s) x slower!!! unacceptable Atomic operations on SW26010 are implemented by lock-and-unlock methodology. 6

7 Performance evaluation on NVIDIA P100 The native atomicadd instruction is used on P100 instead of compare-andswap loop implemented with atomiccas instruction on K80. The performance gap of GTC-P between CUDA and OpenACC are narrowed with the hardware upgrade. 7

Implementation of the OpenACC version on SW26010 Duplication and reduction algorithm is used instead of atomic operations, which is implemented with

8 Implementation of the OpenACC version on SW26010 Duplication and reduction algorithm is used instead of atomic operations, which is implemented with the help of the global variable acc_thread_id. Using tile directive to coalesced access data by DMA request and fill the 64KB LDM. D M A Main Memory 8

9 Performance evaluation of the OpenACC version on SW26010 Elapsed time [sec] Lower is better Baseline 1.1X Shift Smooth Field Poisson Push Charge The performance is acceptable after removing the atomic operations on SW Taking full advantage of DMA bandwidth is the key factor for the memory-bound kernel X Charge kernel is the hotspot of the OpenACC version. 0 Sequential (MPE) OpenACC (CPE) +w/o atomics +Tile +SPM library 9

10 Register level communication on SW26010 The low-latency register communication mechanism is among the CPE cluster, which is the key factor for data locality. 10

The RLC optimization for the charge kernel on SW26010 irregular memory access pattern in the charge kernel The index value are preconditioned

11 The RLC optimization for the charge kernel on SW26010 irregular memory access pattern in the charge kernel The index value are preconditioned on the MPE and then transfer to the first column of the CPE cluster. Irregular access is implemented on the rest CPE by row communication. 11

12 The async optimization for the charge kernel on SW26010 The irregular memory access implemented by RLC on CPE cluster and the rest part due to the limit of SPM space are running simultaneously. Tuning the performance manually. 12

13 Performance tuning of the charge kernel on SW % Finally, we achieved around 4X speedup compared with OpenACC version and the native approach on SW26010 processors. 13

14 How about the scaling of the OpenACC version of GTC-P code on the real supercomputers? (Early Results) 14

15 Experiment results of scaling evaluation on GPU cluster in SJTU Weak Scaling 15

16 Experiment results of scaling evaluation on Titan supercomputer One K20X per node Gemini internconnect Strong scaling is to be done 16

17 Experiment results of scaling evaluation on Sunway TaihuLight supercomputer 17

18 Summary The case study demonstrated the portability of OpenACC on GPU and Chinese home-grown many-core processor. Although the algorithm on SW26010 has to be refractored compared with GPU. The performance gap between the OpenACC version and CUDA of GTC-P on NVIDIA P100 is narrowed with the hardware upgrade. The experiments showed that performance gap on SW26010 can not be ignored due to the lack of high-efficiency general software cache on the CPE cluster. We designed specific register level communication to fix the problem. 18

19 Reference Performance and Portability Studies with OpenACC Accelerated Version of GTC-P. Yueming Wei, Yichao Wang, Linjin Cai, William Tang, Bei Wang, Stephane Ethier, Simon See and James Lin. The 17th International Conference on Parallel and Distributed Computing, Applications and Technologies, Guangzhou, China, December 16-18, Porting and Optimizing GTC-P on TaihuLight Supercomputer with Sunway OpenACC. Yichao Wang, James Lin, Linjin Cai, William Tang, Stephane Ethier, Bei Wang, Simon See and Satoshi Matsuoka. Journal of Computer Research and Development, 2018, 55(4). 19

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor Stephen Wang 1, James Lin 1,4, William Tang 2, Stephane Ethier 2, Bei Wang 2, Simon See 1,3