Scaling Fast Multipole Methods up to 4000 GPUs

Size: px

Start display at page:

Download "Scaling Fast Multipole Methods up to 4000 GPUs"

Clarence Grant Bradley
5 years ago
Views:

1 Scaling Fast Multipole Methods up to 4000 GPUs Rio Yokota King Abdullah University of Science and Technology 4700 KAUST, Thuwal , Saudi Arabia Lorena Barba Boston University 110 Cummington St. Boston, MA 02215, USA Tetsu Narumi University of Electro-Communications Chofugaoka, Chofu Tokyo, , Japan Kenji Yasuoka Keio University Hiyoshi Yokohama, , Japan ABSTRACT The Fast Multipole Method (FMM) is a hierarchical N-body algorithm with linear complexity, high arithmetic intensity, high data locality, has hierarchical communication patterns, and no global synchronization. The combination of these features allows the FMM to scale well on large GPU based systems, and to use their compute capability effectively. We present a 1 PFlop/s calculation of isotropic turbulence with 64 billion vortex particles using 4096 GPUs on the TSUBAME 2.0 system. Categories and Subject Descriptors D.1.3 [Concurrent Programming]: Distributed programming global tree partitioning, hierarchical communication. Parallel programming MPI, OpenMP, CUDA G.1.2 [Approximation]: Special function approximations fast N-body approximation, fast multipole methods. General Terms Algorithms, Performance, Verification. Keywords Fast Multipole Methods; GPUs; Scalability 1. INTRODUCTION N-body algorithms are a natural way to simulate particle-based physics as can be seen in astrophysics [1] and molecular dynamics [2]. Another common application of N-body methods is boundary integral problems that arise in acoustics [3] and electromagnetics [4]. The governing equations of these problems can be broadly categorized as elliptic PDEs, which are non-local in nature and require information to be propagated throughout the entire domain at every time step. Calculating the effect of all particles against all particles is one way to propagate the information globally. There are other ways to propagate information globally, such as solving linear systems, or dealing with the PDE in Fourier space. It could be interesting to look at N-body methods in comparison to these other approaches, and to study their relative performance on next generation hardware. Hierarchical N-body algorithms such as Fast Multipole Methods (FMM) have linear complexity, but retain the arithmetic intensity of brute force N-body methods within the inner kernels. This arithmetic intensity allows FMMs to extract the full potential of the GPU s compute capability. Gumerov and Duraiswami [5] were the first to implement the FMM on GPUs. To supplement the lack of support for complex arithmetic on GPUs, they reformulated their basis functions using real spherical harmonics. They have recently updated their code to handle multi-gpus, and to utilize both the CPU and GPU [6]. The M2M, M2L, and L2L operations were offloaded to the CPU, while the P2M, L2P, and P2P operations were done on the GPU. This decision was based on their observation that M2M, M2L, and L2L kernels could be calculated faster on the CPU. Contrary to these observations, Takahashi et al. [7] developed a technique to maximize the performance of M2L operations on the GPU, and achieved up to 270 GFlop/s on a Tesla C1060 for the M2L kernel. Treecodes have also been implemented efficiently on GPUs. Stock and Gharakhani [8] calculated the Biot-Savart kernel using the treecode on GPUs. Gaburov et al. [9] implemented the Laplace kernel treecode on GPUs, and achieved 100GFlop/s performance and 50GB/s transfer rates. The same authors have recently updated their code to run entirely on GPUs [1]. Burtscher and Pingali [10] present yet another implementation of the treecode entirely on GPUs, where they used locks and atomic operations to reduce synchronization during the tree construction. Multi-GPU implementations of treecodes and FMMs have been the focus of research for the past few years. Yokota et al. calculated the Biot-Savart kernel on 64 GPUs, and achieved a cost performance of $9.4/GFlop/s. [10] This was combined with the

2 work of Hamada and Nitadori, which calculated 1.6 billion particles in 17 seconds, and achieved 42TFlop/s sustained performance on 256 GPUs, and $8.0/GFlop/s [11]. Around the same time, Lashuk et al. also ran a 256 GPU calculation for the FMM, and were able to calculate 256 million points in 2.2 seconds, resulting in 8 TFlop/s of sustained performance [12]. The following year, the former work was extended to achieve 190 TFlop/s on 576 GPUs [13], while the latter work was extended to achieve 0.7 PFlop/s on 200,000 AMD cores [14]. Jetley et al. [15] implemented a treecode on multi-gpus using the CHARMM++ framework, and sustained an average of 3.82 TFlop/s on 896 CPU cores+256 GPUs. In this work, we present a further milestone for the multi-gpu FMM, which achieves over 1 PFlop/s on 4096 GPUs on the TSUBAME 2.0 system. We present novel techniques such as the dual tree traversal, hybridization of treecode and FMM, and autotuning on heterogeneous systems. 2. FAST MULTIPOLE METHOD 2.1 Series Expansions An important factor in the optimization of FMMs on heterogeneous systems is the type of expansion that is used for the multipole and local expansions. Most treecodes use simple Cartesian Taylor expansions [16], while the selection of expansions in FMMs vary from spherical harmonics [17], planewaves [17], Chebyshev polynomials [18], to equivalent charges on a sphere [19] or cube [20]. The asymptotic behavior with respect to the order of truncation p, is different for the different types of expansions mentioned above, as shown in Table 1. Note that the asymptotic behavior means very little without consideration of the asymptotic constant. Methods with better asymptotics tend to have larger constants, and therefore could be slower for low accuracy calculations. For example, gravitational N-body simulations are typically performed with p<=3, and for this case using Cartesian expansions is the optimal choice. The common misperception that treecodes are faster than FMMs for low accuracy calculations is actually the result of the difference in the type of expansions. There is also some difference in the way treecodes form more efficient interaction lists by using the multipole acceptance criterion, which makes a big difference only when p is small and the number of particles per leaf cell becomes small. Table 1. Asymptotic behavior of different types of expansions Expansion Work Storage Cartesian [16] O(p 6 ) O(p 3 ) Spherical [17] O(p 4 ) O(p 2 ) Spherical+Rotation [17] O(p 3 ) O(p 2 ) Plane-wave [17] O(p 3 ) O(p 2 ) Chebychev [18] O(p 6 ) O(p 3 ) Equivalent charge (sphere) [19] O(p 4 ) O(p 2 ) Equivalent charge (cube) [20] O(p 3 log p) O(p 2 ) information moves from red to blue M2M P2M source particles M2L P2P target particles Figure 1. Flow of FMM calculation L2L L2P The strategy for implementing these different types of expansion on GPUs is quite different. For some of these expansions, precomputation of translation matrices is effective, whereas in other cases it is faster to recompute everything inside the GPU kernel and save the amount of data being transferred to the device. So far we have experimented with GPU implementations of Cartesian, spherical [21], spherical+rotation, and equivalent charge (sphere) [10]. For the Cartesian and spherical expansions, computing all translation matrix entries on-the-fly gave the best performance, whereas for the Wigner rotation matrix for the spherical harmonics rotation and inverse matrix for the equivalent charges were faster if they were pre-calculated on the GPU and stored in global memory. Note that for a cubic octree, the nonadaptive M2L translation stencil has 7^3-3^3 possible relative positioning of the cells. Therefore, all translation matrices, Wigner-rotation matrices, inverse matrices of the equivalent charges can be pre-computed and stored on the GPU. 2.2 Auto-tuning The majority of the calculation time in FMMs is spent on the M2L (Multipole-to-Local) and P2P (Particle-to-Particle) kernels shown in Figure 1. The P2P kernel is the same as a brute force N-body calculation except each target particle interacts with neighboring source particles. The P2P kernel does not depend on the type of expansion or the order of truncation. However, it is indirectly influenced by the change in the optimum number of particles per leaf cell, which does depend on the type of expansion and order of truncation. The M2L kernel is directly influenced by both the type of expansion, and order of truncation. The M2L kernel usually has less arithmetic intensity compared to the P2P kernel, so it does not accelerate as much on the GPU. This in turn will shift the balance between the calculation time of these two kernels, which is dealt with by changing the number of particles per leaf cell, or by calculating the M2L kernel on the CPU [6]. Determining what the number of particles per cell should be for a given architecture, or deciding which device to execute the M2L kernel on is a nontrivial matter. We have developed an auto-tuning mechanism to automatically determine the optimum number of particles per cell for any given architecture [22]. In our approach, we time the M2L and P2P kernels on the device that we intend to run on, and then use these timing results to select whether to calculate the M2L or P2P kernel for a given pair of cells. As long as the tree structure is deep enough, this method will switch to P2P kernels at the optimum level and terminate the tree traversal. This concept may seem quite natural for those in the treecode community, but may be difficult to grasp for those in using the FMM.

time [s] 10 3 10 2 10 1 manual (500 ppl) auto (500 ppl) manual (100 ppl) auto (100 ppl) 10 0 10 10 5 10 6 10 7 N Figure 2. FMM runtime for different problem size N Figure 3.

3 time [s] manual (500 ppl) auto (500 ppl) manual (100 ppl) auto (100 ppl) N Figure 2. FMM runtime for different problem size N Figure 3. Boundary element mesh on a lysozyme molecule This is because it is uncommon in FMMs to actually traverse the tree structure, and without the concept of tree traversal this autotuning mechanism will not function. A conventional FMM will loop over all target cells and try to explicitly form a wellseparated list of source cells. Excluding the well-separated source cells at the parent level relies on a neighbor search via Morton indexing, which makes it difficult to exclude the cells that were handled by the P2P kernel. Therefore, if we try to implement our auto-tuning scheme in the standard FMM framework, we would have to devise a method to keep track of which cells were handled by P2P for every single target cell in the tree, and exclude this from the M2L well-separated list. The dual tree traversal method introduced by Warren and Salmon [23] and refined by Dehnen [16] uses the concept of tree traversal but achieves linear complexity, which makes it an FMM. The general idea is to traverse two trees simultaneously --one for the target, and one for the source--. Starting from a pair of root cells, the ``well-separatedness is examined per pair and the larger (or equal) cell is subdivided until they are either well-separated enough to perform M2L or if they both become leafs, at which point the P2P kernel is calculated. This is a very generic and flexible approach, which eliminates the need to calculate explicit well-separated lists of cells in the FMM. It also turns out to be a nice framework for implementing our auto-tuning mechanism. We can simply change the condition to calculate the P2P kernel from ``if both cells are leafs to ``if it is faster to do so. The effect of our auto-tuning mechanism on a single GPU is shown in Figure 2. Each data point is for a different number of particles between 100 thousand and 10 million. In the legend ``manual refers to the FMM without auto-tuning and ``auto refers to the one with auto-tuning, while ``ppl stands for particles per leaf cell. The optimum value for this case is 500 ppl. If we artificially make the tree structure deeper than it should be the M2L kernel will become disproportionally large for most N. This can be observed in the results for the manual case with 100 ppl. However, when the auto-tuning capability is introduced, the tree traversal is terminated at the optimum level and the results match that of the 500 ppl case. This shows that our auto-tuning mechanism will find the correct type of kernel to use for any given architecture. 3. Scalability Results 3.1 Bioelectrostatics on DEGIMA The first scalability test on a large GPU system uses the FMM in conjunction with a boundary element method (BEM) to accelerate a biomolecular electrostatics calculation [24]. The applications demonstrated include the electrostatics of protein drug binding and several multi-million atom systems consisting of hundreds to thousands of copies of lysozyme molecules. A representative figure of the boundary element mesh on a single lysozyme molecule is shown in Figure 3. The parallel scalability of the software was studied on the DEGIMA cluster at the Nagasaki Advanced Computing Center. At the time of the runs, the DEGIMA system had 144 nodes and 288 NVIDIA GTX 295 cards, each with two GPUs resulting in a total of 576 GPUs. There are 6 QDR infiniband switches connected with 4 QDR networks. The switches are connected to the nodes by SDR infiniband, and the total bisection bandwidth is 160 Gbps. We performed a strong scalability study of the FMM on the DEGIMA system using up to 512 GPUs. The global problem size was N=10 8 and the order of multipole expansions was set to p=10. The type of expansion used was spherical harmonics expansions with rotation-based translations. Figure 4 shows the breakdown of the FMM calculation time multiplied by the number of MPI processes. Therefore, a constant value means perfect strong scaling. The same scalability tests are run on TSUBAME 2.0 for which the hardware specifications are described in the following subsection. In the legend ``tree construction denotes the time spent to sort the Morton indices and preprocessing of the FMM kernels, ``mpisendp2p and ``mpisendm2l are the MPI communication times for sending the particles and multipoles, respectively. The other legend entries are self explanatory. The kernel runtimes include the buffering and transfer of data to the GPU, though we have confirmed that 70 % of the time was spent on the actual CUDA kernel [24]. As expected the M2L and P2P kernels take up a large portion of the entire runtime. The scalability on DEGIMA is perfect until 128 GPUs, but rapidly decays after that. At 256 GPUs the parallel efficiency is 78 % and at 512 GPUs it decreases to 48%. On the other hand, the results on TSUBAME 2.0 show a more gradual decrease in the parallel efficiency. At 256 GPUs the parallel efficiency is 79 % (almost the same as DEGIMA), and at 512

4 DEGIMA TSUBAME time x N procs [s] time x N procs [s] tree construction mpisendp2p mpisendm2l P2Pkernel P2Mkernel M2Mkernel M2Lkernel L2Lkernel L2Pkernel N procs N procs Figure 4. Strong scaling of FMM for N=10 8 on DEGIMA and TSUBAME 2.0 GPUs the parallel efficiency is 65 %. The difference in the scalability on the two different systems is more likely to be a latency issue rather than a bandwidth limitation since this is a strong scaling test, and with the acceleration of GPUs the calculation of 100 million points is taking less than half a second. We used a flat MPI parallelization with 4 processes per node on DEGIMA and 3 processes per node on TSUBAME 2.0. This matches the number of GPUs on each node. On DEGIMA the MPI communicator was split into an inter-node communicator and intra-node communicator and the alltoallv communication was performed in two stages. On TSUBAME 2.0 the native MPI_Alltoallv was faster than the two-stage version. In the actual bioelectrostatics BEM run consisted of molecules, where each surface was discretized into 102,486 boundary element nodes. This results in a calculation of over 20 million atoms and over one billion unknowns. The FMM for this configuration took approximately one minute per BEM iteration on 512 GPUs, which yields a sustained performance of 34.6 TFlop/s. 3.2 Turbulence on TSUBAME 2.0 The second scalability test uses the FMM to calculate the interaction of vortex elements in a particle-based turbulence simulation [25]. The tests were run on the full system of TSUBAME 2.0 at the Tokyo Institute of Technology. The TSUBAME 2.0 system has 1408 nodes with a 12-core Westmere- EP 2.93 GHz CPU and 3 NVIDIA M2050 GPUs. It has 54 GB of TAM and 120 GB of local SSD storage. The interconnect is a two-way QDR infiniband with 2 40 Gbps bandwidth, and the bisection bandwidth of the entire system is over 200 Tbps. Most of TSUBAME2's 2.4 PFlops performance comes from its 4224 GPUs; 512 GFlops per GPU and 2.2 PFlops in total. The test case is a decaying isotropic turbulence of initial microscale Reynolds number Re λ =500. The domain size was [-π, π] 3 and was resolved by (69 billion) vortex particles. The FMM was extended to handle periodic boundary conditions by setting 27 periodic images in each direction. The order of truncation in the FMM was set to p=14 to capture the high frequencies of the kinetic energy spectrum. Figure 5. Isosurface of II in isotropic turbulence The isosurface of the second invariant of the velocity gradient tensor is shown in Figure 5. Due to the lack of computational time on the full machine of TSUBAME 2.0, we were not able to simulate the isotropic turbulence to the point where we could observe coherent vortex structures. Nonetheless, the high fidelity of the vortex simulation can be observed in Figure 5. The turbulence run was calculated on 4096 GPUs. We performed a weak scaling test by scaling down the problem size along with the number of GPUs. Therefore, the weak scaling tests had particles per process. The calculation time of one time step of the vortex particle simulation is shown against the number of processes in Figure 6. A constant value will mean perfect weak scaling. In the legend, ``P2P evaluation denotes the time spent on the P2P GPU kernel, this time excluding the buffering and transfer of data. Similartly, ``FMM evaluation is the total of all

5 Time (sec) P2P evaluation FMM evaluation MPI communication GPU buffering Tree construction Number of processes Figure 6. Weak scaling of FMM on TSUBAME 2.0 GPU kernel time spent on all FMM kernels, excluding the buffering and transfer of data. ``MPI communication represents the total time spent on MPI communication of both particles and multipoles. Note however, that the communication of particles is overlapped with the P2P evaluation of local particles. The overlapped time is subtracted from the ``P2P evaluation so that the total height of the bar shows the actual wall clock time. As a side effect it may seem as if the P2P evaluation is shrinking for large number of processes, but it is actually the increasing overlap of communication time that is causing this. ``GPU buffering is the time spent of buffering and transferring of data to the GPU. Unlike the strong scaling test in the previous subsection, the weak scaling tests require more storage than what is available on the GPU device memory and multiple calls to the GPU have to be made. This seems to increase the GPU buffering time. We hope to alleviate this problem by using double buffering and asynchronous memory transfers in the future. Finally, ``tree construction includes the binning of particles into cells and linking of the tree structure, and most importantly the partitioning of the global tree structure. Although our implementation only updates the global tree structure and never reconstructs it, the migration of particles results in a significant amount of communication at the moment. We are investigating a more efficient way to handle the update of global tree structures at the moment. The parallel efficiency on 4096 GPUs was 74 %, and the sustained performance of the turbulence calculation was 1.01 PFlop/s. 4. ACKNOWLEDGMENTS Computing time in the TSUBAME 2.0 system was made possible by the Grand Challenge Program of TSUBAME. LAB acknowledges partial support from NSF grant OCI , ONR award #N , and Boston University College of Engineering. 5. REFERENCES [1] Bedorf, J., Gaburov, E., and Portegies Zwart, S A Sparse Octree Gravitational N-body Code that Runs Entirely on the GPU Processor. J. Comput. Phys. 231, DOI= [2] Levine, B. G., Stone, J. E., and Kohlmeyer, A Fast Analysis of Molecular Dynamics Trajectories with Graphics Processing Units Radial Distribution Function Histogramming. J. Comput. Phys. 230, DOI= [3] Wu, H., Liu, Y., and Jiang, W Analytical Integration of the Moments in the Diagonal Form Fast Multipole Boundary Element Method for 3-D Acoustic Wave Problems. Eng. Anal. Bound. Elem. 36, DOI= [4] Tsuji, P. and Ying. L A Fast Directional Algorithm for High-frequency Electromagnetic Scattering. J. Comput. Phys. 230, DOI= [5] Gumerov, N. A. and Duraiswami, R Fast Multipole Methods on Graphics Processors. J. Comput. Phys. 227, DOI= [6] Hu, Q., Gumerov, N. A., and Duraiswami, R Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures. SC 11 Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. DOI= [7] Takahashi, T., Cecka, C., Fong, W., and Darve, E Optimizing the Multipole-to-local Operator in the Fast Multipole Method for Graphical Processing Units. Int. J. Numer. Meth. Eng. 89, DOI= /nme.3240 [8] Stock, M. J. and Gharakhani, A Toward Efficient GPU-accelerated N-body Simulations. AIAA Paper , [9] Gaburov, E., Bedorf, J., and Portegies-Zwart S Gravitational Tree-code on Graphics Processing Units: Implementation in CUDA. Procedia Computer Science 1, DOI= [10] Yokota, R., Narumi, T., Sakamaki, R., Kameoka, S., Obi, S., and Yasuoka, K Fast Multipole Methods on a Cluster of GPUs for the Meshless Simulation of Turbulence. Comput. Phys. Comm. 180, DOI= [11] Hamada, T., Yokota, R., Nitadori, K., Narumi, T., Yasuoka, K., and Taiji, M TFlops Hierarchical N-body Simulations on GPUs with Applications in both Astrophyiscs and Turbulence. SC 09 Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. DOI= [12] Lashuk, I., Chandramowlishwaran, A., Langson, H., Nguyen, T. A., Sampath, R., Shringarpure, A., Vuduc, R., Ying, L., Zorin, D., and Biros, G A Massively Parallel Adaptive Fast Multipole Method on Heterogeneous Architectures. SC 09 Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. DOI= [13] Hamada, T. and Nitadori, K TFlops Astrophysical N-body Simulation on Cluster of GPUs. SC 10 Proceedings of the Conference on High Performance Computing

6 Networking, Storage and Analysis. DOI= [14] Rahimian, A., Lashuk, I., Veerapaneni, K., Chandramowlishwaran, A., Malhotra, D., Moon, L., Sampath, R., Shringarpure, A., Vetter, J., Vuduc, R., Zorin, D., and Biros, G Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures. SC 10 Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. DOI= [15] Jetley, P., Wesolowski, L., Gioachin, F., Kale, L. V., and Quinn, T. R Scaling Hierarchical N-body Simulations on GPU Clusters. SC 10 Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. DOI= [16] Dehnen, W A Hierarchical O(N) Force Calculation Algorithm. J. Comput. Phys. 179, DOI= [17] Cheng, H. Greengard, L., and Rokhlin, V A Fast Adaptive Multipole Algorithm in Three Dimensions. J. Comput. Phys. 155, DOI= [18] Fong, W. and Darve, E The Black-box Fast Multipole Method. J. Comput. Phys. 228, DOI= [19] Makino, J Yet Another Fast Multipole Method Without Multipoles- Pseudoparticle Multipole Method. J. Comput. Phys. 151, DOI= [20] Ying, L., Biros, G., and Zorin D A Kernel- Independent Adaptive Fast Multipole Algorithm in Two And Three Dimensions. J. Comput. Phys. 196, DOI= / /j.jcp [21] Yokota, R. and Barba, L. A Treecode and Fast Multipole Method for N-body Simulation with CUDA. GPU Computing Gems Emerald Edition, Chapter 9. Morgan Kaufmann. [22] Yokota, R. and Barba, L. A Hierarchical N-body Simulations with Auto-tuning for Heterogeneous Systems. Comput. Sci. Eng., in press DOI= [23] Warren, M. S. and Salmon, J. K A Portable Parallel Particle Program. Comput. Phys. Comm. 87, DOI= [24] Yokota, R., Bardhan, J. P., Knepley, M. G., Barba, L. A., and Hamada, T Biomolecular Electrostatics Using a Fast Multipole BEM on up to 512 GPUs and a Billion Unknowns. Comput. Phys. Comm. 182, DOI= [25] Yokota, R., Barba, L. A., Narumi, T., and Yasuoka, K. Petascale Turbulence Simulation Using a Highly Parallel Fast Multipole Method. DOI= arxiv: v1

The success of multipole methods: Are we there yet?

King Abdullah University of Science and Technology Saudi Arabia, Apr. 2830, 2012 The success of multipole methods: Are we there yet? Lorena A Barba, Boston University 1948 The world s first computation