Performance and Portability Studies with OpenACC Accelerated Version of GTC-P

Size: px

Start display at page:

Download "Performance and Portability Studies with OpenACC Accelerated Version of GTC-P"

Philippa Hudson
6 years ago
Views:

1 Performance and Portability Studies with OpenACC Accelerated Version of GTC-P Yueming Wei, Yichao Wang, Linjin Cai, William Tang, Bei Wang, Stephane Ethier, Simon See, James Lin Center for High Performance Computing, Shanghai Jiao Tong University, China Princeton Institute of Computational Science and Engineering, Princeton University, Princeton, NJ, USA Princeton Plasma Physics Laboratory, Princeton, NJ, USA NVIDIA, Singapore Tokyo Institute of Technology, Japan Abstract Accelerator-based heterogeneous computing is of paramount importance to High Performance Computing. The increasing complexity of the cluster architectures requires more generic, high-level programming models. OpenACC is a directivebased parallel programming model, which provides performance on and portability across a wide variety of platforms, including GPU, multicore CPU, and many-core processors. GTC-P is a discovery-science-capable real-world application code based on the Particle-In-Cell (PIC) algorithm that is well-established in the HPC area. Basic versions of this code have demonstrated performance portability on TOP5 supercomputers with different architectures, including Titan, Mira, etc. Motivated by the increasing interest in state-of-art portability of emerging architectures, we have now implemented the first OpenACC version of GTC-P and evaluated its performance portability across NVIDIA GPUs, Intel x86 and OpenPOWER CPUs. In this paper, we have also proposed two key optimization methods for OpenACC implementation of PIC algorithm on multicore CPU and GPU including with focus on removing atomic operation and taking advantage of shared memory. The OpenACC version is shown to be able to deliver impressive productivity and performance with respect to portability and scalability, including achieving more than 9% performance compared with the native versions with only about 3 LOC. Keywords: Gyrokinetic PIC code, GTC-P, OpenACC, CUDA, GPU, OpenPOWER I. INTRODUCTION Recent years have witnessed the widespread adoption of accelerators for High Performance Computing (HPC). By the end of 215, more than 1 accelerated systems are on the list of the world s 5 most powerful supercomputers, accounting for 143 petaflops, over one-third of the list s total FLOPS[1]. While accelerators provide tremendous computational power, they have a significant difference in architecture, which demands more generic, high-level programming models. OpenACC is one such designed to tackle this problem. It provides a directive-based approach for a single source code base with function portability across different platforms, including GPU, multicore CPU, and many-core processors. OpenACC today is considered among the most promising approaches to develop high-performance scientific applications[2]. The directives and Corresponding author. address: james@sjtu.edu.cn. programming model defined in the OpenACC allow programmers to create high-level host+accelerator programs without the need to explicitly initialize the accelerator, manage data or program transfers between the host and accelerator. OpenACC directives expose parallelism in C, C++, or Fortran programs, which compilers use to provide optimizations for different architectures, such as x86 multicore, GPU, and OpenPOWER. The portability provided by OpenACC make it possible to run a single source code across platforms with different architectures. The Gyrokinetic Toroidal Code developed by Princeton (GTC-P) is an advanced plasma turbulence HPC code that can be applied to high-resolution size-scaling studies relevant to the next-generation International Thermonuclear Experimental Reactor (ITER)[9]. GTC-P is a modern co-design version of the comprehensive original GTC code[8] with focus on performance enhancement of the basic particle-in-cell (PIC) operations that deliver the capability to efficiently carry out computations at extreme scales with unprecedented resolution and speed on a variety of different architectures on presentday multi-petaflop supercomputers, including Tianhe-2, Titan, Sequoia, Mira, etc., that feature GPU, CPU multicore, and many-core processors[5]. In this study, motivated by improving state-of-art portability on advanced architectures, we focus on the development and implementation of an OpenACC version of GTC-P. This includes associated evaluation of performance portability across multiple modern platforms and scalability across multiple nodes. Original contributions : We implement the first OpenACC accelerated version of GTC-P. This paper present two optimization methods for OpenACC version of GTC-P on x86 multicore and GPU to achieve more than 9% performance with 34 and 214 LOC including removing atomic operation and taking advantage of shared memory. We study the performance portability of OpenACC version of GTC-P on x86 multicore, x86+gpu, Open- POWER+GPU and the scalability of this OpenACC code on the real cluster.

The rest of this paper is organized as follows. Section II describes the related research. Our implementation and optimization of GTC-P using OpenACC is presented in Section III and IV.

In [4], Tetsuya Hoshino et al. compare the performance of OpenACC and CUDA based on two microbenchmarks and one real world CFD application.

2 The rest of this paper is organized as follows. Section II describes the related research. Our implementation and optimization of GTC-P using OpenACC is presented in Section III and IV. Section V includes the detail of performance evaluation and we conclude the paper in Section VI. II. RELATED WORK OpenACC evaluation. Several works focused on the performance evaluation of OpenACC. In [4], Tetsuya Hoshino et al. compare the performance of OpenACC and CUDA based on two microbenchmarks and one real world CFD application. Their study shows that the lack of the interface for shared memory in OpenACC leads to the performance gap between OpenACC and CUDA. Yichao Wang et al. study the performance portability for OpenACC on NVIDIA Kepler and Intel Knight Corner by one benchmark suite[1]. However, their study only focuses on the single node evaluation, portability is not considered. Moreover, x86 multicore and OpenPOWER are not evaluated in their study, which is supported by the recent release of PGI compiler. OpenACC porting work. Several real-world applications have been ported to GPU using OpenACC. Kraus et al. [6] and Stone et al. [7] use OpenACC to accelerate the CFD algorithm on multi-gpus. Their study showed that OpenACC allows significant acceleration of applications with a large code base with reasonable effort. However, they only study the performance of OpenACC on GPUs, portability across multiplatform is not considered. F. Hariri et al. developed a portable platform for accelerating PIC code on heterogeneous manycore architectures[3]. Their study showed that by porting the PIC algorithm to GPU using only OpenACC, a performance gain of factor 3 could be obtained on a Kepler K2X GPU than a Sandy Bridge 8-core CPU. When CUDA is used for fine-tuning optimization, the performance gain would be up to 3.4. In our study, we also use CUDA to implement the charge routine for further optimization. However, this paper only focuses on parallelization strategies on the single node using OpenACC, portability across multi-platforms and scalability across multi-nodes are not evaluated. III. OPENACC IMPLEMENTATION OF GTC-P This section describes in details the strategies we adopt to implement our OpenACC version of GTC-P based on the GTC-P code running on CPU[9]. The code base is the GPC-P code on CPU with OpenMP directives to take full advantage of multicore. We profiled the CPU version code using test case A on a Sandy Bridge CPU(E5-267) and located the hotspots charge and push routines, which consume 86.1% overall executing time, as shown in Fig. 1. Therefore, our target is to port charge and push onto the accelerator, and leave the rest four subroutines on the host. A. Parallelization The porting process to OpenACC is performed by converting one routine at a time. The porting starting from charge, Hotspots > 85% Fig. 1: GTC-P s profiling result on 8-core Sandy Bridge CPU(E5-267) using test case A. then moves to push. The parallel regions are defined by the parallel directives combined with data clauses to signify the movement of data in and out of the parallel region. We add the parallel directives at the top of every loop nest, accompanied by the copy, copyin and copyout data clauses. We use independent clause to indicated that there are no data dependencies between loops. B. Data management The naive OpenACC implementation with offloading the computational-heavy loop nests to GPU, however, decreases the performance on CPU. The performance gap is caused by the increase of elapsed time of data transfer from host to device (HTD) and device to host (DTH). In our test case, we simulate 1-time steps. Therefore, data transfer happens 1 times for each kernel, which dominates the performance on the GPU. Obviously, we certainly do not want to waste time on moving data between the host and GPU each iteration. w/o opt.: w/ opt.: memory copy compute HTD and DTH memory copy in every step HTD memory copy only at the beginning of the program Fig. 2: Profiling of GTC-P using nvvp to analyze the data movement between the host and device. To reduce the data movement, we create a containing data region using the data directive and combine the parallel directives with the present data clause indicating the data already is present in the device memory. Besides, we need

3 to port shift routine to GPU to keep all the data on the device operated by GPU. This would allow for data to be initialized on the host, transferred to the device memory at the start of the simulation, and retained in device memory for the application lifetime. Fig. 2 shows the profiling result before and after reducing the movement of data between the host and the device. C. Atomic operation Accelerating the majority part of each kernel could be accomplished by apply parallel loop directive over the main loop. In charge routine, however, there is one thing we need carefully deal with data race caused by the main loop over all the particles. Different particles may update the same grid simultaneously, thus creating data race and incorrect result. We solve this issue by using atomic memory operations as shown in listing 1. Listing 1: Atomic operation in charge routine #pragma acc parallel loop for(m=; m < mi; m++) { ij1 = kk + (mzeta+1)*(jtiontmp-igrid_in); ij2 = kk + (mzeta+1)*(jtion1tmp-igrid_in); #pragma acc atomic update densityi[ij1] += d1; #pragma acc atomic update densityi[ij1+1] += d2; } IV. OPENACC OPTIMIZATION OF GTC-P In this section, we present the optimization methods we adopt to optimize the OpenACC version of GTC-P on Kepler GPU and x86 multicore CPU. A. Kepler GPU 1) Thread mapping: On NVIDIA Kepler, up to 248 threads can run on a single streaming multiprocessor (SMX). All the threads active on the same SMX compete for the limited resource, such as L1 cache, texture cache and register. Therefore, reducing the occupancy properly can improve the performance. PGI s OpenACC implementation allows to control the occupancy manually by setting the block size through vector length clause. Besides the occupancy, we also need to consider the setup cost time for each thread. By default, the compiler will launch enough gangs with the corresponding vector width indicated by vector length to cover the iteration space. More threads means fewer iterations executed by each thread. When the number of iterations executed by each thread is smaller enough, the setup cost of each thread will amortize the performance gained by the computing. We use num gangs clause and vector length clause to control the number of iterations executed by each thread. When we apply parallel loop directive over a two-level nested loop, OpenACC will apply gangs parallelism to the outer loop and vector parallelism to the inner loop automatically. In push routine, the inner loop has just four iterations, and we need to add a loop seq over the inner loop to avoid the automatical parallelization by the compiler to improve the performance. 2) Optmization with CUDA: OpenACC is a deviceindependent standard that does not include support for device dependent hardware components such as texture cache and shared memory in GPU memory model, which means the shared memory cannot be explicitly allocated and used for sharing data between threads in thread blocks. This gives the OpenACC version a significant performance limitation compared to the CUDA version. Moreover, thread ID is unavailable compared with CUDA and OpenMP. Therefore we cannot use cooperative computation to capture locality for co-scheduled threads as CUDA does. OpenACC has interoperability with CUDA, which means we can use some highly specialized function available in CUDA for further optimization, of course at the expense of portability and maintainability. We use CUDA code to optimize the charge routine to use the shared memory explicitly just as the CUDA version does. With additional 14 LOC, the hybrid OpenACC code obtains 1.4x speedup compared to the optimized pure OpenACC code on a single node. With step-by-step optimization, the OpenACC code achieves 4.21x speedup compared with the original OpenMP code on a single node, as shown in Fig. 3. baseline OpenMP CPU 1.24X OpenACC 3.5X + thread mapping GPU better 4.21X + CUDA LOC + 3 LOC + 14 LOC Fig. 3: Wallclock time (secs) for test case A with one node (E5-2695v3 with two NVIDIA K8) showing the OpenMP, OpenACC baseline, with thread mapping optimization and CUDA optimization. B. x86 multicore CPU When we compiled and run the OpenACC code(v1) directly on x86 CPU, there is a significant drop in performance of

4 charge(v1), which is about 2 times slower than the OpenMP code, as shown in Table I. TABLE I: Perofrmance of OpenACC and OpenMP charge on 16-core Sandy Bridge x86 multicore with test case A. OpenACC charge V1(w atomic) s charge V2(w/o atomic) 8.91s iteration OpenMP 7.87s physical core Fig. 4: An illustration of thread mapping policy of OpenACC on multicore. Assume there is loop with eight iterations running on a CPU with four physical cores, then the first two iterations will be assigned to the 1st physical core, and the 3rd and 4th iterations will be assigned to the 2nd physical cores, the rest may be deduced by analogy. Through the profiling result, we found atomic is the key factor that causes the OpenACC code performance drop on x86. In the OpenMP version of GTC-P code, atomic operation in charge is avoided by allocating replication copies for array densityi part and do reduction at the end, where thread ID is required. OpenMP thread ID can be obtained by the API function omp get num threads. However, there is no such API function in OpenACC. Thus we cannot obtain the thread ID explicitly. But when we understand the thread mapping policy of OpenACC on multicore, we can get the thread ID manually and remove the atomic operation. An illustration of thread mapping policy of OpenACC on multicore is shown in Fig. 4 and Listing 2 is the code example showing how to obtain thread ID manually. After 34 LOC modification of the original OpenACC code, the executing time of charge(v2) reduced from s to 8.91s, which is almost 177x speedup. Listing 2: Obtain OpenACC thread ID manually base on OpenACC thread mapping policy on multicore. if(mi % ACC_NUM_CORES == ) mi_per_core = mi / ACC_NUM_CORES; else mi_per_core = mi / ACC_NUM_CORES + 1; #pragma acc parallel loop for(m = ; m < mi; ++m) { } thread_id = m / mi_per_core; V. EVALUATION RESULTS AND ANALYSIS In this section, we evaluate the portability and scalability of the OpenACC implementation of GTC-P. For portability, we run the OpenACC code on three different architectures and compare their performance on the single node across different platforms. We evaluate the scalability of OpenACC by comparing it with the CUDA version and the OpenMP version across multi-nodes. A. Portability In this section, we analyze the portability of the OpenACC version we implemented in previous sections across different platforms. We consider three different target computing systems supported by PGI OpenACC compiler: x86 multicore, x86 + NVIDIA GPU, and OpenPOWER + NVIDIA GPU. Table II shows the system configuration of different platforms. NVIDIA K8 is a dual-gpu graphics card, we only study the portability of OpenACC code on a single node without MPI. Therefore, only one of the two GK21 GPUs is used. The target architecture for the compilation is specified by appropriate options(e.g., -ta=tesla and -ta=multicore for NVIDIA GPU and x86 multicore CPU respectively). TABLE II: System configuration for the portability evaluation Platform Host Device Compiler CPU+GPU Intel Xeon E Tesla K8 PGI 16.5 CPU+GPU POWER8 Tesla K8 PGI 16.7(Beta) CPU Intel Xeon E PGI 16.5 In the x86 multicore cases, the parallelization performed by the compiler is similar to that implied by the OpenMP directive. The compiler uses all available cores on the processor unless a different number is specified by the gang clause or through the ACC NUM CORES environment variables. A multicore CPU is considered as a shared memory accelerator, so data clauses are ignored, and no data copies are executed. From the point of view, portability of the code is great as the same C-code, annotated with the same OpenACC pragmas, immediately runs on all architectures. However, the result on the x86 is not entirely satisfying, and we optimized the code by removing the atomic operation with additional 34 LOC as described previously. Fig. 5 shows the portability of the OpenACC code across different platforms. From a performance portability perspective, this is an excellent result. B. Scalability We evaluated the scalability of our OpenACC implementation version GTC-P with the existed CUDA and OpenMP version using the π cluster at the High Performance Computing Center (HPCC), Shanghai Jiao Tong University. The

5 Elapsed time [sec] Lower is better Sort Smooth Field Poisson Push Charge TABLE IV: Evaluation of the weak scaling from A to C plasma size using number of MPI processes ranging from 1 to 16 Problem size A B C mpsi mthetamax mgrid total particles # MPI Lower is better C B A x86 multicore x86+gpu POWER8+GPU 4 Fig. 5: OpenACC performance on three different platforms hardware and software environment employed for one node is shown in Table III. TABLE III: Machine environment for scalability evaluation(π cluster) Elapsed time [sec] CPU Intel Xeon E (2.6GHz) GPU NVIDIA Tesla Kepler K2M 2 Interconnection Infiniband: Mellanox FDR 56Gb/s OS Red Hat Enterprise Linux Server release 6.3 MPI mpich/3.2 Compiler PGI 16.4 OpenACC CUDA OpenMP Fig. 6: Weak scaling of OpenACC, CUDA, and OpenMP versions of GTC-P(vertical scale = wall-clock time for 1 time-steps). 1) Weak scaling: In this study, we perform the weak scaling experiments. The simulation size of GTC-P is decided by several numerical parameters. Table IV shows the default parameters for problem size A, B, and C based on physical evaluation of OpenACC and CUDA version, which we used to evaluate the weak scaling of GTC-P. For the weak scaling of OpenACC and CUDA, we use up to 8 GPU nodes of π where 2 Tesla K2M are running on each node, and the total number of GPUs ranged from 1 to 16. For the OpenMP version weak scaling, we use up to 8 CPU nodes of π where two 8-core Sandy Bridge are running on each node. We respectively use 1, 4 and 16 MPI processes to evaluate the A, B and C problem size with one MPI per GPU or CPU, where the number of particles per MPI is about Our parameters and the number of MPI processes setting for weak scaling ensure the number of grid and particles per MPI constant for each test cases. Figure 6 shows the weak scaling results for A, B, and C problem sizes with a particle density of 1 using OpenACC version, CUDA version, and OpenMP version. We start from A size problem with npartdom=1, the B size problem with npartdom=4 and C size problem with npartdom=16, respectively. Table IV shows the other important parameters settings. For our weak scaling experiments, the grid and particles are constant for each MPI process in different test cases. From the figure, we can see that OpenACC achieves comparable scalability to CUDA. 2) Strong scaling: Then we use up to 16 nodes to perform the strong scaling evaluation. For the strong scaling, we keep the problem size constant and vary the number of MPI from 2 to 32. We perform the strong scaling experiments with the radial domain decomposition for test case B with the resolution of 1 particles per cell. Fig. 7 shows the strong scaling result of OpenMP, OpenACC, and CUDA version of GTC-P on up to 16 nodes. From the figure, we can see that, as expected CUDA, represents the standard for scalability performance but that the OpenACC version is nevertheless offers reasonably comparable performance together with ease of use. Fig. 8 shows the strong scaling of push and charge of OpenACC. We can see that push routine scales very well and the line is almost linear. Charge routine is the main factor that influences the scalability of OpenACC. VI. CONCLUSION AND FUTURE WORK Motivated to improve state-of-art portability, we have developed and implemented an OpenACC version of GTC- P, a discovery-science-capable modern PIC code deployed in increasing problem size investigations with unprecedented resolution of large nuclear fusion systems. We have run the OpenACC code across different architectures and multinodes to evaluate its portability and scalability. Associated comparisons of the performance results achieved by OpenACC with previous versions of this code were carried out to examine

6 better Fig. 7: Strong scaling of OpenMP, OpenACC and CUDA version of GTC-P. better Fig. 8: Strong scaling of OpenACC version of GTC-P and the accelerated kernels. whether OpenACC is a good option for accelerating this example of a real-world PIC application. It is found that OpenACC version exhibits high level of productivity and portability. In particular, the new OpenACC version satisfies our goal of achieving acceptable performance on NVIDIA GPUs with a reasonable level of effort that is considerably less demanding than doing so with CUDA. This is illustrated by the fact that with only 213 out of 8 LOC for such an application code, the new OpenACC version has successfully achieved 73% performance of a CUDA version of the code on GPUs with acceptable scalability up to 16 nodes. In future investigations, we plan to further extend the evaluation of the OpenACC version of GTC-P on other platforms such as ARM with the goal of gaining important insights into how to archive reasonable performance portability for a production code (as exemplified by GTC-P) instead of simple kernels alone. This approach holds promise of providing more meaningful benchmarks in realistic assessments of advanced HPC code performance on leadership-class supercomputers. VII. ACKNOWLEDGEMENT This research is supported in part by an NSF SAVI project in the US supporting pilot international collaborative research and development and by a China National High-Tech R&D Plan (863 plan) 214AA1A32 and 216YFB218 at Shanghai Jiao Tong University (SJTU), a NVIDIA Center of Excellence. SJTU vice-director James Lin gratefully acknowledges support from Japans JSPS RONPAKU (PhD Thesis) Program at Tokyo Institute of Technology. Thanks are also extended to Dr. Zhen Wang and Michael Wolfe of PGI for their much appreciated advice, also to OpenPOWER Foundation for kindly providing us access to an OpenPOWER test machine. REFERENCES [1] Accelerator use surges in world s top supercomputers. [2] C. Bonati, E. Calore, S. Coscetti, M. D elia, M. Mesiti, F. Negro, S.F. Schifano, and R. Tripiccione. Development of scientific software for hpc architectures using open acc: The case of lqcd. In Software Engineering for High Performance Computing in Science (SE4HPCS), 215 IEEE/ACM 1st International Workshop on, pages IEEE, 215. [3] F. Hariri, T.M. Tran, A. Jocksch, E. Lanti, J. Progsch, P. Messmer, S. Brunner, C. Gheller, and L. Villard. A portable platform for accelerated pic codes and its application to gpus using openacc. Computer Physics Communications, 216. [4] T. Hoshino, N. Maruyama, S. Matsuoka, and R. Takaki. Cuda vs openacc: Performance case studies with kernel benchmarks and a memory-bound cfd application. In Cluster, Cloud and Grid Computing (CCGrid), th IEEE/ACM International Symposium on, pages IEEE, 213. [5] K.Z. Ibrahim, K. Madduri, S. Williams, B. Wang, S. Ethier, and L. Oliker. Analysis and optimization of gyrokinetic toroidal simulations on homogenous and heterogenous platforms. International Journal of High Performance Computing Applications, 27(4): , 213. [6] J. Kraus, M. Schlottke, A. Adinetz, and D. Pleiter. Accelerating a c++ cfd code with openacc. In Proceedings of the First Workshop on Accelerator Programming using Directives, pages IEEE Press, 214. [7] C.P. Stone and B.H. Elton. Accelerating the multi-zone scalar pentadiagonal cfd algorithm with openacc. In Proceedings of the Second Workshop on Accelerator Programming using Directives, page 2. ACM, 215. [8] K. Tsugane, T. Boku, H. Murai, M. Sato, W. Tang, and B. Wang. Hybridview programming of nuclear fusion simulation code in the pgas parallel programming language xcalablemp. Parallel Computing, 216. [9] B. Wang, S. Ethier, W. Tang, T. Williams, K.Z. Ibrahim, K. Madduri, S. Williams, and L. Oliker. Kinetic turbulence simulations at extreme scale on leadership-class systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 82. ACM, 213. [1] Y. Wang, Q. Qin, S. SEE, and J. Lin. Performance portability evaluation for openacc on intel knights corner and nvidia kepler. HPC China, 213.

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor Stephen Wang 1, James Lin 1,4, William Tang 2, Stephane Ethier 2, Bei Wang 2, Simon See 1,3