Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer

Size: px

Start display at page:

Download "Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer"

Myles Kelly
5 years ago
Views:

1 Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer Xiaoqian Zhu 1,2, Xin Liu 1, Xiangfei Meng 2, Jinghua Feng 2 1 School of Computer, National University of Defense Technology, Changsha, China 2 National Supercomputer Center in Tianjin, Tianjin, China zhu_xiaoqian@sina.com, liuxin_0117@yeah.net, mengxf@nscc-tj.gov.cn, fengjh@nscc-tj.gov.cn Abstract In this study, we test and analyze the performance of Gyrokinetic Torodial Code(GTC) program. According to the analysis results, we port GTC s compute-intensive subroutines to GPU and speed up them on the CPU+GPU heterogeneous architecture of TH-1A supercomputer. Some optimization strategies are developed in this process, for example, subroutines are integrated to reduce the data transfer between host and device, GPU memory access is optimized to reduce the access latency and static keyword is designed before arrays declaration to avoid unnecessary address allocation and data copy. Experiment results show that the performance of the subroutines ported to GPU is improved evidently, which is between 6 and 8 times, and the total performance of GTC could be improved by 3 to 4 times. Keywords-component; TH-1A, GPU, high performance computing, GTC, nuclear fusion I. INTRODUCTION In recent years, high performance computing(hpc) is becoming a hot topic in modern society, and the supercomputing power is also considered as a benchmark of science innovation of one country. In order to satisfy people s need of computing power, and also because of the development barrier of the single core processor, the multicore technology developed greatly. The multicore technology includes homogeneous multicore technology and heterogeneous multicore technology. Heterogeneous multicore adopts the main core plus coprocessor design pattern, in which the main core charges the process of complex logical matter, while the coprocessor charges intensive computation, and this cooperation mode establishes an important approach for the computing power of computers. Nowadays, CPU+GPU heterogeneous system is one of the typical heterogeneous multicore architecture. In this architecture, CPU acts as the main core, and GPU acts as the coprocessor. The computing power of GPU grows fast in recent years. NVIDIA s Tesla M2050 GPU is based on Fermi architecture, and it includes 448 CUDA cores, all of which could provide more that 1TFlops for single precision computation, and for double precision it is about 600GFlops. TH-1A supercomputer manufactured by National University of Defense Technology is based on CPU+GPU heterogeneous hybrid architecture. It ranked No.1 in the TOP500 list of the world supercomputer in November 2010 [1]. Nowadays, nuclear fusion energy is considered as one most important method to solve the energy and environment challenges of human world completely. GTC is an important program of fusion energy research, and it is one of the simulation programs which could fully use supercomputers computing power, is also selected as a benchmark program which could evaluate the performance of supercomputer by United States Department of Energy. The importance of GTC also means that this program should be processed in a fast speed. GTC has been ported to most parallel computers in order to get a high efficiency, but the performance is still needed to be improved. In this study we use TH-1A supercomputer to test GTC, and in order to improve the whole application s performance, we speed up GTC s compute-intensive subroutines using GPU for the CPU+GPU hybrid heterogeneous architecture of TH-1A. In this process, some optimization strategies are developed, such as subroutines are integrated to reduce the data transfer between host and device, GPU memory access is optimized to reduce memory access latency and static keyword is designed before array s declaration to avoid unnecessary address allocation and data copy. Experiment results show that, the performance of the subroutines ported to GPU is improved evidently, which is about between 6 and 8 times, and the total performance of GTC can be improved by 3 to 4 times. In this study, we first introduce the system architecture of TH-1A and GTC background in section 2. In section 3, we test GTC on TH-1A, and according to analysis we find out the performance bottleneck of GTC to decide which subroutines should be speeded up with GPU. Section 4 introduces our optimization techniques in detail. We present the GTC s performance improvement after optimization in section 5. In the last section, we make some conclusions and propose our future work. II. BANKGROUND A. System Architecture of TH-1A Figure 1 shows the system architecture of TH-1A. TH-1A is composed of computing system, interconnected communication system, monitoring and diagnostic system, and I/O system. The computing system includes 1024 service nodes and 7168 computing nodes. Each service node is configured with two FT-1000 CPU (2.93GHz, eight-core). The computing nodes adopt the main core plus coprocessor design pattern. Each computing node is configured with two Intel Xeon CPU(x GHz, six-cores) and one NVIDIA M2050 GPU. NVIDIA M2050 GPU is based on Fermi architecture, including fourteen Streaming Multi-processors, each of which includes 32 CUDA cores. The frequency of each CUDA core is /11/$ IEEE 6027

B. GTC Program Gyrokinetic Toroidal Code is a time-dependent 3- dimension PIC (particle-in-cell)

The PIC method is used to describe the complex interactive problem between field and plasmas.

particles [5]. The main steps of the PCI algorithm are as follows.

Then the electrostatic potential and field at each grid point.

Finally, we move the particles according to the forces just calculated, and repeat these steps

PERFORMANCE ANALYSIS OF GTC ON TH-1A We test GTC program on TH-1A supercomputer to obtain the

The electron module takes up about 80% of the whole processing time, and becomes the bottleneck

(a)time proportion of modules (b) Time proportion of subroutines Figure 2 Execution time

The electron module is composed of four subroutines including DPHIEFF, PUSHE, SHIFTE and CHARGEE.

DPHIEFF and CHARGEE only takes up about 1% of the electron module, but PUSHE and Figure 1 System

PUSHE and SHIFTE belong to compute-intensive subroutines, so in order to improve the total

It worth to be noted that PUSHE takes up the most part of the execution time of electron module,

But because PUSHE and SHIFTE are called by turns in GTC s main function, and there exits some

between host and device if we only port PUSHE to GPU.

Subroutines Integrated Optimization The data processing steps in CPU+GPU heterogeneous

transferred from host to device; (2) GPU processes data stored in device memory; (3) CPU charges

Host memory and device memory are connected with PCI-E 2.

2 1.15GHz. The peak computing power of TH-1A is 4700TFlops, and Linpack test performance is 2566TFlops [2]. B. GTC Program Gyrokinetic Toroidal Code is a time-dependent 3- dimension PIC (particle-in-cell) program, and it is used to research the turbulent transport problem in fusion plasmas. The PIC method is used to describe the complex interactive problem between field and plasmas. When the plasmas organized by charged particles moved in strong magnetic field, they could found the electrostatic field. The particles are driven to move by spatial gradients in the equilibrium profiles of the plasma temperature and density [3-4]. And then it is necessary to solve the Poisson equation to specify the trajectory of the charged particles [5]. The main steps of the PCI algorithm are as follows. First we distribute the charge of each particle on its nearest grid points. Then the electrostatic potential and field at each grid point. We then calculate the force acting on each particle from the field at its nearest grid points. Finally, we move the particles according to the forces just calculated, and repeat these steps until the end of the simulation [6]. III. PERFORMANCE ANALYSIS OF GTC ON TH-1A We test GTC program on TH-1A supercomputer to obtain the execution time of each module. The execution time proportion of modules is shown in Figure 2(a). The electron module takes up about 80% of the whole processing time, and becomes the bottleneck of GTC s total performance. (a)time proportion of modules (b) Time proportion of subroutines Figure 2 Execution time proportion of GTC. The electron module is composed of four subroutines including DPHIEFF, PUSHE, SHIFTE and CHARGEE. The execution time proportion of each subroutine is shown in Figure 2(b), execution time of DPHIEFF and CHARGEE only takes up about 1% of the electron module, but PUSHE and Figure 1 System architectures of TH-1A supercomputer. SHIFTE takes up about 99%. PUSHE and SHIFTE belong to compute-intensive subroutines, so in order to improve the total performance of GTC, we can use GPU to speed up PUSHE and SHIFTE. It worth to be noted that PUSHE takes up the most part of the execution time of electron module, we should use GPU to speed up the computation of PUSHE. But because PUSHE and SHIFTE are called by turns in GTC s main function, and there exits some data dependence between these two subroutines, this may lead to a great deal of data transfer between host and device if we only port PUSHE to GPU. So in order to avoid this, we port SHIFTE to GPU too. IV. OPTIMIZATION STRATEGIES A. Subroutines Integrated Optimization The data processing steps in CPU+GPU heterogeneous concurrency systems are as follows: (1) the initial data waiting to be processed should be transferred from host to device; (2) GPU processes data stored in device memory; (3) CPU charges to transfer the processed result from device memory back to host memory. Host memory and device memory are connected with PCI-E bus, which could provide about 8GB/s effective bandwidth [7]. So the data transfer between host and device across PCI-E bus often becomes the bottleneck of applications performance. We adopt the method of subroutines integrated in speeding up PUSHE and SHIFTE, which holds the opinion that the arrays with a large scale of data should be placed in device memory as long as possible. In this way we are able to avoid the big arrays transfer between host and device in a large degree. Figure 3(a) shows that, PUSHE and SHIFTE are called in a do-loop body in GTC, and they are called by turns. When we use GPU to speed up the computation of these two subroutines, we should transfer the initial data and processed result between host and device across PCI-E bus. Figure 2(b) shows the framework that these two subroutines are ported to GPU separately. At each call of the subroutine the initial data and processed result are needed to be transferred correspondingly between host and device, which blocks the promotion of application s total performance. So our paper adopts subroutines integrated strategy, which is shown as Figure 3(c), PUSHE and SHIFTE subroutines are integrated into the pushe_shifte_func function. The data copy needed for 6028

3 computation is placed outside the loop body, and computation of the whole loop is all placed on GPU. In this way the times of data copy is reduced from 8*ncycle*irk times to twice. It is worth to be noted that, there exits some MPI communication in PUSHE subroutine, which is needed to be operated in host end. So the arrays needed for the MPI operation should still be transferred from device memory do host memory. But because that this communication operation is located in if statement, we can place the data copy inside if statement. Then the data copy only activates when if condition is true. This data copy has almost no affect on GTC s total performance because that if condition has a very low hit rate. Figure 3 Subroutines integrated optimization. B. Memory Access Optimization There are different memory spaces in device, including global memory, texture memory, shared memory, cache and registers. These memory spaces provide basic conditions for optimization to variables with different access characteristics [7]. We adopt two kinds of optimization strategies to reduce the memory access latency in PUSHE and SHIFTE s computation on GPU, one is to make threads access global memory in a merged way, the other is to bind some special arrays to texture memory. The former optimization technology is used to provide a higher bandwidth; and the latter optimization technology is used to reduce the latency when the data accessed by adjacent threads is located in the same spatial area. The access pattern and optimization method of arrays in PUSHE and SHIFTE are shown in table 1, wzgc and wpgc0 have met the condition that threads access global memory in a merged way, and it is also not necessary to put them in shared memory because they are accessed only a few times; for arrays like rgpsi and mtheta, there are no optimization methods because their access pattern are very irregular. TABLE I. Arrays gradphi, bsp, xsp wbgc, wpgc, zelectron, zelectron0 wzgc, wpgc0, wtgc0, wtgc1, jtgc0, jtgc1 rgpsi, mtheta, igrid, qtinv and so on MEMORY ACCESS OPTIMIZATION Memory Access Characteristics Strong locally access Threads access not merged Threads access merged threads access irregularly Optimization Method bind to texture memory transpose to merge no no C. static Keyword Optimization The compiler could allocate a fixed storage area for static variables if they are declared with static keyword, and they will exist all through the program s execution. We define a class named gpu_array whose code is shown as follows. We could declare and initialize arrays with the construct function of class gpu_array. For some arrays, we could place static keyword before their declaration to avoid unnecessary address allocation and data copy from host to device. template <class T> class gpu_array{ public: gpu_array(size_t size){ size_=size; cutilsafecall(cudamalloc((void**)&ptr_, size*sizeof(t))); } gpu_array(t* data, size_t size){ size_=size; cutilsafecall(cudamalloc((void**)&ptr_, size*sizeof(t))); cutilsafecall(cudamemcpy(ptr_, data, size*sizeof(t), cudamemcpyhosttodevice)); } private: T *ptr; size_t size_; 6029

The arrays used in PUSHE routine, including global arrays and local arrays, are separated into the following different kinds: Local arrays: the arrays whose definition, initialization and usage are

4 } Function pushe_shifte_func is called repeatedly in the loop body of GTC s main function. So in order to optimize the application in the aspects of address allocation and data copy, we use static declaration appropriately before some arrays declaration. The arrays used in PUSHE routine, including global arrays and local arrays, are separated into the following different kinds: Local arrays: the arrays whose definition, initialization and usage are all inside PUSHE, for example, jtgc0, jtgc1, delt, delp, vdrtmp, wzgc, wpgc0, wtgc1, wpgc, wbgc and so on. For the reason that the sizes of these arrays are not changed and their values are also not necessary to be copied from host to device, then we use static keyword before their declaration, which could save the data arrays address allocation time. static gpu_array<float> d_array(arraysize); Global arrays: arrays needed for the computation of PUSHE and SHIFTE are passed into pushe_shifte_func function across the parameters patter, and then these arrays could be separated further into two parts according to their changing situation in the whole program s execution time: (1) The array s value is specified in the beginning of GTC program s execution, and it will not be changed in the whole executing time. So this kind of array is not necessary to be copied from host to device after the first call of pushe_shifte_func function. So we use static keyword before these arrays declaration in order to save the time of address allocation and data copy from host to device. static gpu_array<float> d_array(h_array, arraysize); (2) Global arrays whose value is often changed during GTC s execution. Because these arrays in device memory should be updated every time the function pushe_shifte_func is called, we don t use static keyword. In this way, when we call function pushe_shifte_func, these arrays in device memory are updated to the latest value. gpu_array<float> d_array(h_array, arraysize); V. EXPERIMENTS A. Input Files We use three input files to test the performance of GTC on TH-1A supercomputer. The input files are gtc_63.in, gtc_125.in and gtc_250.in. The number of grid points defined in the three input files is different, and this number is decided by variables mpsi, mthetamax and r0. The three parameters of each input file are shown in table 2. In this paper we set MPI processes to 32, 64 and 128, and one process is corresponding with one GPU. B. Experiments Results 1. gtc_63.in The experiment result using gtc_63.in input file is shown in Figure 4, the electron module including the computing time of subroutines PUSHE and SHIFTE. The lable electron-gpu denotes the new time of electron module when PUSHE and SHIFTE are ported to GPU, total denotes the total execution time of GTC when PUSHE and SHIFTE are computed by CPU, and total-new denotes the new total execution time of GTC when PUSHE and SHIFTE are ported to GPU. The experiment results show that, the performance is improved evidently after we speed up subroutines PUSHE and SHIFTE with GPU. Performance of PUSHE and SHIFTE is improved about 5 times, and the whole performance of GTC is improved about 3 times. 2. gtc_125.in and gtc_250.in Figure 4 Test of gtc_63.in. The experiment results using gtc_125.in and gtc_250.in are shown as Figure 5 and Figure 6. Performance of PUSHE and SHIFTE is improved about five to eight times, and the total performance of GTC is improved about three to four times. Figure 5 Test of gtc_125.in. TABLE II. PARAMETERS OF INPUT FILES mpsi mthetamax r0 gtc_63.in gtc_125.in gtc_250.in Figure 6 Test of gtc_250.in. 6030

5 VI. CONCLUSIONS AND FUTURE WORK In this study, we test and analyze the performance of Gyrokinetic Torodial Code(GTC) program. According to analysis results, we port GTC s compute-intensive subroutines to GPU and speed up them on the CPU+GPU heterogeneous architecture of TH-1A. In this process we developed some optimization strategies in order to improve the total performance of GTC. Experiment results show that the performance of PUSHE and SHIFTE could be improved between 5 and 8 times, and the total performance of GTC is improved between 3 and 4 times. Our future work includes two parts, one is to speed up other compute-intensive routines of GTC with GPU; the other is to develop CPU+GPU heterogeneous parallel technology with MPI+OpenMP+CUDA method in order to make fully use of TH-1A s computing power. ACKNOWLEDGMENT This work is supported by the National Natural Science Foundation of China under grant No , NO Thanks to professors ZhiHong Lin, Yi an Lei, Yong Xiao of Peking University, they help us solve many problems when we port GTC program to TH-1A supercomputer. And thanks to Peng Wang in NVIDIA Corporation, he gives us many good advices on optimization of GTC program on GPU. REFERENCES [1] [2] [3] S.Ethier, W. M. Tang and Z. Lin, Gyrokinetic particle-in-cell simulations of plasma microturbulence on advanced computing platforms, Journal of Physics: Conference Series, [4] Z. Lin, T. S. Hahm and W. W. Lee, Turbulent transport reduction by zonal flows: Massively parallel simulations, Science [J], 1998, PP: [5] Z. Lin; T. S. Hahm; W. W. Lee; et al. Method for solving the gyrokinetic poisson equation in general geometry. Physical Review E- PHYS REV E [J], 1995, PP: [6] S.Klasky; S. Ethier; Z. Lin; et al. Grid-Based Parallel Data Streaming implemented for the Gyrokinetic Torodial Code. Proceedings of the 2003 ACM/IEEE conference on Supercomputing [C], [7] NVIDIA Corporation NVIDIA CUDA Programming Gudie. [EB/OL],

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor Stephen Wang 1, James Lin 1,4, William Tang 2, Stephane Ethier 2, Bei Wang 2, Simon See 1,3