Scaling Fast Multipole Methods up to 4000 GPUs

Size: px
Start display at page:

Download "Scaling Fast Multipole Methods up to 4000 GPUs"

Transcription

1 Scaling Fast Multipole Methods up to 4000 GPUs Rio Yokota King Abdullah University of Science and Technology 4700 KAUST, Thuwal , Saudi Arabia Lorena Barba Boston University 110 Cummington St. Boston, MA 02215, USA Tetsu Narumi University of Electro-Communications Chofugaoka, Chofu Tokyo, , Japan Kenji Yasuoka Keio University Hiyoshi Yokohama, , Japan ABSTRACT The Fast Multipole Method (FMM) is a hierarchical N-body algorithm with linear complexity, high arithmetic intensity, high data locality, has hierarchical communication patterns, and no global synchronization. The combination of these features allows the FMM to scale well on large GPU based systems, and to use their compute capability effectively. We present a 1 PFlop/s calculation of isotropic turbulence with 64 billion vortex particles using 4096 GPUs on the TSUBAME 2.0 system. Categories and Subject Descriptors D.1.3 [Concurrent Programming]: Distributed programming global tree partitioning, hierarchical communication. Parallel programming MPI, OpenMP, CUDA G.1.2 [Approximation]: Special function approximations fast N-body approximation, fast multipole methods. General Terms Algorithms, Performance, Verification. Keywords Fast Multipole Methods; GPUs; Scalability 1. INTRODUCTION N-body algorithms are a natural way to simulate particle-based physics as can be seen in astrophysics [1] and molecular dynamics [2]. Another common application of N-body methods is boundary integral problems that arise in acoustics [3] and electromagnetics [4]. The governing equations of these problems can be broadly categorized as elliptic PDEs, which are non-local in nature and require information to be propagated throughout the entire domain at every time step. Calculating the effect of all particles against all particles is one way to propagate the information globally. There are other ways to propagate information globally, such as solving linear systems, or dealing with the PDE in Fourier space. It could be interesting to look at N-body methods in comparison to these other approaches, and to study their relative performance on next generation hardware. Hierarchical N-body algorithms such as Fast Multipole Methods (FMM) have linear complexity, but retain the arithmetic intensity of brute force N-body methods within the inner kernels. This arithmetic intensity allows FMMs to extract the full potential of the GPU s compute capability. Gumerov and Duraiswami [5] were the first to implement the FMM on GPUs. To supplement the lack of support for complex arithmetic on GPUs, they reformulated their basis functions using real spherical harmonics. They have recently updated their code to handle multi-gpus, and to utilize both the CPU and GPU [6]. The M2M, M2L, and L2L operations were offloaded to the CPU, while the P2M, L2P, and P2P operations were done on the GPU. This decision was based on their observation that M2M, M2L, and L2L kernels could be calculated faster on the CPU. Contrary to these observations, Takahashi et al. [7] developed a technique to maximize the performance of M2L operations on the GPU, and achieved up to 270 GFlop/s on a Tesla C1060 for the M2L kernel. Treecodes have also been implemented efficiently on GPUs. Stock and Gharakhani [8] calculated the Biot-Savart kernel using the treecode on GPUs. Gaburov et al. [9] implemented the Laplace kernel treecode on GPUs, and achieved 100GFlop/s performance and 50GB/s transfer rates. The same authors have recently updated their code to run entirely on GPUs [1]. Burtscher and Pingali [10] present yet another implementation of the treecode entirely on GPUs, where they used locks and atomic operations to reduce synchronization during the tree construction. Multi-GPU implementations of treecodes and FMMs have been the focus of research for the past few years. Yokota et al. calculated the Biot-Savart kernel on 64 GPUs, and achieved a cost performance of $9.4/GFlop/s. [10] This was combined with the

2 work of Hamada and Nitadori, which calculated 1.6 billion particles in 17 seconds, and achieved 42TFlop/s sustained performance on 256 GPUs, and $8.0/GFlop/s [11]. Around the same time, Lashuk et al. also ran a 256 GPU calculation for the FMM, and were able to calculate 256 million points in 2.2 seconds, resulting in 8 TFlop/s of sustained performance [12]. The following year, the former work was extended to achieve 190 TFlop/s on 576 GPUs [13], while the latter work was extended to achieve 0.7 PFlop/s on 200,000 AMD cores [14]. Jetley et al. [15] implemented a treecode on multi-gpus using the CHARMM++ framework, and sustained an average of 3.82 TFlop/s on 896 CPU cores+256 GPUs. In this work, we present a further milestone for the multi-gpu FMM, which achieves over 1 PFlop/s on 4096 GPUs on the TSUBAME 2.0 system. We present novel techniques such as the dual tree traversal, hybridization of treecode and FMM, and autotuning on heterogeneous systems. 2. FAST MULTIPOLE METHOD 2.1 Series Expansions An important factor in the optimization of FMMs on heterogeneous systems is the type of expansion that is used for the multipole and local expansions. Most treecodes use simple Cartesian Taylor expansions [16], while the selection of expansions in FMMs vary from spherical harmonics [17], planewaves [17], Chebyshev polynomials [18], to equivalent charges on a sphere [19] or cube [20]. The asymptotic behavior with respect to the order of truncation p, is different for the different types of expansions mentioned above, as shown in Table 1. Note that the asymptotic behavior means very little without consideration of the asymptotic constant. Methods with better asymptotics tend to have larger constants, and therefore could be slower for low accuracy calculations. For example, gravitational N-body simulations are typically performed with p<=3, and for this case using Cartesian expansions is the optimal choice. The common misperception that treecodes are faster than FMMs for low accuracy calculations is actually the result of the difference in the type of expansions. There is also some difference in the way treecodes form more efficient interaction lists by using the multipole acceptance criterion, which makes a big difference only when p is small and the number of particles per leaf cell becomes small. Table 1. Asymptotic behavior of different types of expansions Expansion Work Storage Cartesian [16] O(p 6 ) O(p 3 ) Spherical [17] O(p 4 ) O(p 2 ) Spherical+Rotation [17] O(p 3 ) O(p 2 ) Plane-wave [17] O(p 3 ) O(p 2 ) Chebychev [18] O(p 6 ) O(p 3 ) Equivalent charge (sphere) [19] O(p 4 ) O(p 2 ) Equivalent charge (cube) [20] O(p 3 log p) O(p 2 ) information moves from red to blue M2M P2M source particles M2L P2P target particles Figure 1. Flow of FMM calculation L2L L2P The strategy for implementing these different types of expansion on GPUs is quite different. For some of these expansions, precomputation of translation matrices is effective, whereas in other cases it is faster to recompute everything inside the GPU kernel and save the amount of data being transferred to the device. So far we have experimented with GPU implementations of Cartesian, spherical [21], spherical+rotation, and equivalent charge (sphere) [10]. For the Cartesian and spherical expansions, computing all translation matrix entries on-the-fly gave the best performance, whereas for the Wigner rotation matrix for the spherical harmonics rotation and inverse matrix for the equivalent charges were faster if they were pre-calculated on the GPU and stored in global memory. Note that for a cubic octree, the nonadaptive M2L translation stencil has 7^3-3^3 possible relative positioning of the cells. Therefore, all translation matrices, Wigner-rotation matrices, inverse matrices of the equivalent charges can be pre-computed and stored on the GPU. 2.2 Auto-tuning The majority of the calculation time in FMMs is spent on the M2L (Multipole-to-Local) and P2P (Particle-to-Particle) kernels shown in Figure 1. The P2P kernel is the same as a brute force N-body calculation except each target particle interacts with neighboring source particles. The P2P kernel does not depend on the type of expansion or the order of truncation. However, it is indirectly influenced by the change in the optimum number of particles per leaf cell, which does depend on the type of expansion and order of truncation. The M2L kernel is directly influenced by both the type of expansion, and order of truncation. The M2L kernel usually has less arithmetic intensity compared to the P2P kernel, so it does not accelerate as much on the GPU. This in turn will shift the balance between the calculation time of these two kernels, which is dealt with by changing the number of particles per leaf cell, or by calculating the M2L kernel on the CPU [6]. Determining what the number of particles per cell should be for a given architecture, or deciding which device to execute the M2L kernel on is a nontrivial matter. We have developed an auto-tuning mechanism to automatically determine the optimum number of particles per cell for any given architecture [22]. In our approach, we time the M2L and P2P kernels on the device that we intend to run on, and then use these timing results to select whether to calculate the M2L or P2P kernel for a given pair of cells. As long as the tree structure is deep enough, this method will switch to P2P kernels at the optimum level and terminate the tree traversal. This concept may seem quite natural for those in the treecode community, but may be difficult to grasp for those in using the FMM.

3 time [s] manual (500 ppl) auto (500 ppl) manual (100 ppl) auto (100 ppl) N Figure 2. FMM runtime for different problem size N Figure 3. Boundary element mesh on a lysozyme molecule This is because it is uncommon in FMMs to actually traverse the tree structure, and without the concept of tree traversal this autotuning mechanism will not function. A conventional FMM will loop over all target cells and try to explicitly form a wellseparated list of source cells. Excluding the well-separated source cells at the parent level relies on a neighbor search via Morton indexing, which makes it difficult to exclude the cells that were handled by the P2P kernel. Therefore, if we try to implement our auto-tuning scheme in the standard FMM framework, we would have to devise a method to keep track of which cells were handled by P2P for every single target cell in the tree, and exclude this from the M2L well-separated list. The dual tree traversal method introduced by Warren and Salmon [23] and refined by Dehnen [16] uses the concept of tree traversal but achieves linear complexity, which makes it an FMM. The general idea is to traverse two trees simultaneously --one for the target, and one for the source--. Starting from a pair of root cells, the ``well-separatedness is examined per pair and the larger (or equal) cell is subdivided until they are either well-separated enough to perform M2L or if they both become leafs, at which point the P2P kernel is calculated. This is a very generic and flexible approach, which eliminates the need to calculate explicit well-separated lists of cells in the FMM. It also turns out to be a nice framework for implementing our auto-tuning mechanism. We can simply change the condition to calculate the P2P kernel from ``if both cells are leafs to ``if it is faster to do so. The effect of our auto-tuning mechanism on a single GPU is shown in Figure 2. Each data point is for a different number of particles between 100 thousand and 10 million. In the legend ``manual refers to the FMM without auto-tuning and ``auto refers to the one with auto-tuning, while ``ppl stands for particles per leaf cell. The optimum value for this case is 500 ppl. If we artificially make the tree structure deeper than it should be the M2L kernel will become disproportionally large for most N. This can be observed in the results for the manual case with 100 ppl. However, when the auto-tuning capability is introduced, the tree traversal is terminated at the optimum level and the results match that of the 500 ppl case. This shows that our auto-tuning mechanism will find the correct type of kernel to use for any given architecture. 3. Scalability Results 3.1 Bioelectrostatics on DEGIMA The first scalability test on a large GPU system uses the FMM in conjunction with a boundary element method (BEM) to accelerate a biomolecular electrostatics calculation [24]. The applications demonstrated include the electrostatics of protein drug binding and several multi-million atom systems consisting of hundreds to thousands of copies of lysozyme molecules. A representative figure of the boundary element mesh on a single lysozyme molecule is shown in Figure 3. The parallel scalability of the software was studied on the DEGIMA cluster at the Nagasaki Advanced Computing Center. At the time of the runs, the DEGIMA system had 144 nodes and 288 NVIDIA GTX 295 cards, each with two GPUs resulting in a total of 576 GPUs. There are 6 QDR infiniband switches connected with 4 QDR networks. The switches are connected to the nodes by SDR infiniband, and the total bisection bandwidth is 160 Gbps. We performed a strong scalability study of the FMM on the DEGIMA system using up to 512 GPUs. The global problem size was N=10 8 and the order of multipole expansions was set to p=10. The type of expansion used was spherical harmonics expansions with rotation-based translations. Figure 4 shows the breakdown of the FMM calculation time multiplied by the number of MPI processes. Therefore, a constant value means perfect strong scaling. The same scalability tests are run on TSUBAME 2.0 for which the hardware specifications are described in the following subsection. In the legend ``tree construction denotes the time spent to sort the Morton indices and preprocessing of the FMM kernels, ``mpisendp2p and ``mpisendm2l are the MPI communication times for sending the particles and multipoles, respectively. The other legend entries are self explanatory. The kernel runtimes include the buffering and transfer of data to the GPU, though we have confirmed that 70 % of the time was spent on the actual CUDA kernel [24]. As expected the M2L and P2P kernels take up a large portion of the entire runtime. The scalability on DEGIMA is perfect until 128 GPUs, but rapidly decays after that. At 256 GPUs the parallel efficiency is 78 % and at 512 GPUs it decreases to 48%. On the other hand, the results on TSUBAME 2.0 show a more gradual decrease in the parallel efficiency. At 256 GPUs the parallel efficiency is 79 % (almost the same as DEGIMA), and at 512

4 DEGIMA TSUBAME time x N procs [s] time x N procs [s] tree construction mpisendp2p mpisendm2l P2Pkernel P2Mkernel M2Mkernel M2Lkernel L2Lkernel L2Pkernel N procs N procs Figure 4. Strong scaling of FMM for N=10 8 on DEGIMA and TSUBAME 2.0 GPUs the parallel efficiency is 65 %. The difference in the scalability on the two different systems is more likely to be a latency issue rather than a bandwidth limitation since this is a strong scaling test, and with the acceleration of GPUs the calculation of 100 million points is taking less than half a second. We used a flat MPI parallelization with 4 processes per node on DEGIMA and 3 processes per node on TSUBAME 2.0. This matches the number of GPUs on each node. On DEGIMA the MPI communicator was split into an inter-node communicator and intra-node communicator and the alltoallv communication was performed in two stages. On TSUBAME 2.0 the native MPI_Alltoallv was faster than the two-stage version. In the actual bioelectrostatics BEM run consisted of molecules, where each surface was discretized into 102,486 boundary element nodes. This results in a calculation of over 20 million atoms and over one billion unknowns. The FMM for this configuration took approximately one minute per BEM iteration on 512 GPUs, which yields a sustained performance of 34.6 TFlop/s. 3.2 Turbulence on TSUBAME 2.0 The second scalability test uses the FMM to calculate the interaction of vortex elements in a particle-based turbulence simulation [25]. The tests were run on the full system of TSUBAME 2.0 at the Tokyo Institute of Technology. The TSUBAME 2.0 system has 1408 nodes with a 12-core Westmere- EP 2.93 GHz CPU and 3 NVIDIA M2050 GPUs. It has 54 GB of TAM and 120 GB of local SSD storage. The interconnect is a two-way QDR infiniband with 2 40 Gbps bandwidth, and the bisection bandwidth of the entire system is over 200 Tbps. Most of TSUBAME2's 2.4 PFlops performance comes from its 4224 GPUs; 512 GFlops per GPU and 2.2 PFlops in total. The test case is a decaying isotropic turbulence of initial microscale Reynolds number Re λ =500. The domain size was [-π, π] 3 and was resolved by (69 billion) vortex particles. The FMM was extended to handle periodic boundary conditions by setting 27 periodic images in each direction. The order of truncation in the FMM was set to p=14 to capture the high frequencies of the kinetic energy spectrum. Figure 5. Isosurface of II in isotropic turbulence The isosurface of the second invariant of the velocity gradient tensor is shown in Figure 5. Due to the lack of computational time on the full machine of TSUBAME 2.0, we were not able to simulate the isotropic turbulence to the point where we could observe coherent vortex structures. Nonetheless, the high fidelity of the vortex simulation can be observed in Figure 5. The turbulence run was calculated on 4096 GPUs. We performed a weak scaling test by scaling down the problem size along with the number of GPUs. Therefore, the weak scaling tests had particles per process. The calculation time of one time step of the vortex particle simulation is shown against the number of processes in Figure 6. A constant value will mean perfect weak scaling. In the legend, ``P2P evaluation denotes the time spent on the P2P GPU kernel, this time excluding the buffering and transfer of data. Similartly, ``FMM evaluation is the total of all

5 Time (sec) P2P evaluation FMM evaluation MPI communication GPU buffering Tree construction Number of processes Figure 6. Weak scaling of FMM on TSUBAME 2.0 GPU kernel time spent on all FMM kernels, excluding the buffering and transfer of data. ``MPI communication represents the total time spent on MPI communication of both particles and multipoles. Note however, that the communication of particles is overlapped with the P2P evaluation of local particles. The overlapped time is subtracted from the ``P2P evaluation so that the total height of the bar shows the actual wall clock time. As a side effect it may seem as if the P2P evaluation is shrinking for large number of processes, but it is actually the increasing overlap of communication time that is causing this. ``GPU buffering is the time spent of buffering and transferring of data to the GPU. Unlike the strong scaling test in the previous subsection, the weak scaling tests require more storage than what is available on the GPU device memory and multiple calls to the GPU have to be made. This seems to increase the GPU buffering time. We hope to alleviate this problem by using double buffering and asynchronous memory transfers in the future. Finally, ``tree construction includes the binning of particles into cells and linking of the tree structure, and most importantly the partitioning of the global tree structure. Although our implementation only updates the global tree structure and never reconstructs it, the migration of particles results in a significant amount of communication at the moment. We are investigating a more efficient way to handle the update of global tree structures at the moment. The parallel efficiency on 4096 GPUs was 74 %, and the sustained performance of the turbulence calculation was 1.01 PFlop/s. 4. ACKNOWLEDGMENTS Computing time in the TSUBAME 2.0 system was made possible by the Grand Challenge Program of TSUBAME. LAB acknowledges partial support from NSF grant OCI , ONR award #N , and Boston University College of Engineering. 5. REFERENCES [1] Bedorf, J., Gaburov, E., and Portegies Zwart, S A Sparse Octree Gravitational N-body Code that Runs Entirely on the GPU Processor. J. Comput. Phys. 231, DOI= [2] Levine, B. G., Stone, J. E., and Kohlmeyer, A Fast Analysis of Molecular Dynamics Trajectories with Graphics Processing Units Radial Distribution Function Histogramming. J. Comput. Phys. 230, DOI= [3] Wu, H., Liu, Y., and Jiang, W Analytical Integration of the Moments in the Diagonal Form Fast Multipole Boundary Element Method for 3-D Acoustic Wave Problems. Eng. Anal. Bound. Elem. 36, DOI= [4] Tsuji, P. and Ying. L A Fast Directional Algorithm for High-frequency Electromagnetic Scattering. J. Comput. Phys. 230, DOI= [5] Gumerov, N. A. and Duraiswami, R Fast Multipole Methods on Graphics Processors. J. Comput. Phys. 227, DOI= [6] Hu, Q., Gumerov, N. A., and Duraiswami, R Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures. SC 11 Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. DOI= [7] Takahashi, T., Cecka, C., Fong, W., and Darve, E Optimizing the Multipole-to-local Operator in the Fast Multipole Method for Graphical Processing Units. Int. J. Numer. Meth. Eng. 89, DOI= /nme.3240 [8] Stock, M. J. and Gharakhani, A Toward Efficient GPU-accelerated N-body Simulations. AIAA Paper , [9] Gaburov, E., Bedorf, J., and Portegies-Zwart S Gravitational Tree-code on Graphics Processing Units: Implementation in CUDA. Procedia Computer Science 1, DOI= [10] Yokota, R., Narumi, T., Sakamaki, R., Kameoka, S., Obi, S., and Yasuoka, K Fast Multipole Methods on a Cluster of GPUs for the Meshless Simulation of Turbulence. Comput. Phys. Comm. 180, DOI= [11] Hamada, T., Yokota, R., Nitadori, K., Narumi, T., Yasuoka, K., and Taiji, M TFlops Hierarchical N-body Simulations on GPUs with Applications in both Astrophyiscs and Turbulence. SC 09 Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. DOI= [12] Lashuk, I., Chandramowlishwaran, A., Langson, H., Nguyen, T. A., Sampath, R., Shringarpure, A., Vuduc, R., Ying, L., Zorin, D., and Biros, G A Massively Parallel Adaptive Fast Multipole Method on Heterogeneous Architectures. SC 09 Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. DOI= [13] Hamada, T. and Nitadori, K TFlops Astrophysical N-body Simulation on Cluster of GPUs. SC 10 Proceedings of the Conference on High Performance Computing

6 Networking, Storage and Analysis. DOI= [14] Rahimian, A., Lashuk, I., Veerapaneni, K., Chandramowlishwaran, A., Malhotra, D., Moon, L., Sampath, R., Shringarpure, A., Vetter, J., Vuduc, R., Zorin, D., and Biros, G Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures. SC 10 Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. DOI= [15] Jetley, P., Wesolowski, L., Gioachin, F., Kale, L. V., and Quinn, T. R Scaling Hierarchical N-body Simulations on GPU Clusters. SC 10 Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. DOI= [16] Dehnen, W A Hierarchical O(N) Force Calculation Algorithm. J. Comput. Phys. 179, DOI= [17] Cheng, H. Greengard, L., and Rokhlin, V A Fast Adaptive Multipole Algorithm in Three Dimensions. J. Comput. Phys. 155, DOI= [18] Fong, W. and Darve, E The Black-box Fast Multipole Method. J. Comput. Phys. 228, DOI= [19] Makino, J Yet Another Fast Multipole Method Without Multipoles- Pseudoparticle Multipole Method. J. Comput. Phys. 151, DOI= [20] Ying, L., Biros, G., and Zorin D A Kernel- Independent Adaptive Fast Multipole Algorithm in Two And Three Dimensions. J. Comput. Phys. 196, DOI= / /j.jcp [21] Yokota, R. and Barba, L. A Treecode and Fast Multipole Method for N-body Simulation with CUDA. GPU Computing Gems Emerald Edition, Chapter 9. Morgan Kaufmann. [22] Yokota, R. and Barba, L. A Hierarchical N-body Simulations with Auto-tuning for Heterogeneous Systems. Comput. Sci. Eng., in press DOI= [23] Warren, M. S. and Salmon, J. K A Portable Parallel Particle Program. Comput. Phys. Comm. 87, DOI= [24] Yokota, R., Bardhan, J. P., Knepley, M. G., Barba, L. A., and Hamada, T Biomolecular Electrostatics Using a Fast Multipole BEM on up to 512 GPUs and a Billion Unknowns. Comput. Phys. Comm. 182, DOI= [25] Yokota, R., Barba, L. A., Narumi, T., and Yasuoka, K. Petascale Turbulence Simulation Using a Highly Parallel Fast Multipole Method. DOI= arxiv: v1

The success of multipole methods: Are we there yet?

The success of multipole methods: Are we there yet? King Abdullah University of Science and Technology Saudi Arabia, Apr. 2830, 2012 The success of multipole methods: Are we there yet? Lorena A Barba, Boston University 1948 The world s first computation

More information

Fast-multipole algorithms moving to Exascale

Fast-multipole algorithms moving to Exascale Numerical Algorithms for Extreme Computing Architectures Software Institute for Methodologies and Abstractions for Codes SIMAC 3 Fast-multipole algorithms moving to Exascale Lorena A. Barba The George

More information

Scalable Distributed Fast Multipole Methods

Scalable Distributed Fast Multipole Methods Scalable Distributed Fast Multipole Methods Qi Hu, Nail A. Gumerov, Ramani Duraiswami University of Maryland Institute for Advanced Computer Studies (UMIACS) Department of Computer Science, University

More information

A Kernel-independent Adaptive Fast Multipole Method

A Kernel-independent Adaptive Fast Multipole Method A Kernel-independent Adaptive Fast Multipole Method Lexing Ying Caltech Joint work with George Biros and Denis Zorin Problem Statement Given G an elliptic PDE kernel, e.g. {x i } points in {φ i } charges

More information

Hierarchical N-body algorithms: A pattern likely to lead at extreme scales

Hierarchical N-body algorithms: A pattern likely to lead at extreme scales ICERM, Brown University Topical Workshop: Synchronization-reducing and Communication-reducing Algorithms and Programming Models for Large-scale Simulations Providence, Jan. 9 13, 2012 Hierarchical N-body

More information

Fast N-body Simulations on GPUs

Fast N-body Simulations on GPUs Fast N-body Simulations on GPUs Algorithms designed to efficiently solve this classical problem of physics fit very well on GPU hardware, and exhibit excellent scalability on many GPUs. Their computational

More information

ExaFMM. Fast multipole method software aiming for exascale systems. User's Manual. Rio Yokota, L. A. Barba. November Revision 1

ExaFMM. Fast multipole method software aiming for exascale systems. User's Manual. Rio Yokota, L. A. Barba. November Revision 1 ExaFMM Fast multipole method software aiming for exascale systems User's Manual Rio Yokota, L. A. Barba November 2011 --- Revision 1 ExaFMM User's Manual i Revision History Name Date Notes Rio Yokota,

More information

Fast Multipole and Related Algorithms

Fast Multipole and Related Algorithms Fast Multipole and Related Algorithms Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov Efficiency by exploiting symmetry and A general

More information

Fast Multipole Methods on a Cluster of GPUs for the Meshless Simulation of Turbulence

Fast Multipole Methods on a Cluster of GPUs for the Meshless Simulation of Turbulence Fast Multipole Methods on a Cluster of GPUs for the Meshless Simulation of Turbulence Rio Yokota 1, Tetsu Narumi 2, Ryuji Sakamaki 3, Shun Kameoka 3, Shinnosuke Obi 3, Kenji Yasuoka 3 1 Department of Mathematics,

More information

Automatic Tuning of the Fast Multipole Method Based on Integrated Performance Prediction

Automatic Tuning of the Fast Multipole Method Based on Integrated Performance Prediction Original published: H. Dachsel, M. Hofmann, J. Lang, and G. Rünger. Automatic tuning of the Fast Multipole Method based on integrated performance prediction. In Proceedings of the 14th IEEE International

More information

Center for Computational Science

Center for Computational Science Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,

More information

Kernel Independent FMM

Kernel Independent FMM Kernel Independent FMM FMM Issues FMM requires analytical work to generate S expansions, R expansions, S S (M2M) translations S R (M2L) translations R R (L2L) translations Such analytical work leads to

More information

Scalable Force Directed Graph Layout Algorithms Using Fast Multipole Methods

Scalable Force Directed Graph Layout Algorithms Using Fast Multipole Methods Scalable Force Directed Graph Layout Algorithms Using Fast Multipole Methods Enas Yunis, Rio Yokota and Aron Ahmadia King Abdullah University of Science and Technology 7 KAUST, Thuwal, KSA 3955-69 {enas.yunis,

More information

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors The Fast Multipole Method on NVIDIA GPUs and Multicore Processors Toru Takahashi, a Cris Cecka, b Eric Darve c a b c Department of Mechanical Science and Engineering, Nagoya University Institute for Applied

More information

GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging

GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani With Nail A. Gumerov,

More information

On the limits of (and opportunities for?) GPU acceleration

On the limits of (and opportunities for?) GPU acceleration On the limits of (and opportunities for?) GPU acceleration Aparna Chandramowlishwaran, Jee Choi, Kenneth Czechowski, Murat (Efe) Guney, Logan Moon, Aashay Shringarpure, Richard (Rich) Vuduc HotPar 10,

More information

Iterative methods for use with the Fast Multipole Method

Iterative methods for use with the Fast Multipole Method Iterative methods for use with the Fast Multipole Method Ramani Duraiswami Perceptual Interfaces and Reality Lab. Computer Science & UMIACS University of Maryland, College Park, MD Joint work with Nail

More information

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline CMSC 858M/AMSC 698R Fast Multipole Methods Nail A. Gumerov & Ramani Duraiswami Lecture 20 Outline Two parts of the FMM Data Structures FMM Cost/Optimization on CPU Fine Grain Parallelization for Multicore

More information

The Fast Multipole Method (FMM)

The Fast Multipole Method (FMM) The Fast Multipole Method (FMM) Motivation for FMM Computational Physics Problems involving mutual interactions of N particles Gravitational or Electrostatic forces Collective (but weak) long-range forces

More information

Efficient O(N log N) algorithms for scattered data interpolation

Efficient O(N log N) algorithms for scattered data interpolation Efficient O(N log N) algorithms for scattered data interpolation Nail Gumerov University of Maryland Institute for Advanced Computer Studies Joint work with Ramani Duraiswami February Fourier Talks 2007

More information

Fast Multipole Method on the GPU

Fast Multipole Method on the GPU Fast Multipole Method on the GPU with application to the Adaptive Vortex Method University of Bristol, Bristol, United Kingdom. 1 Introduction Particle methods Highly parallel Computational intensive Numerical

More information

Turbulence Simulation Using Vortex Particles on 4096 GPUs

Turbulence Simulation Using Vortex Particles on 4096 GPUs 6 Turbulence Simulation Using 4096 3 Vortex Particles on 4096 GPUs An Ultra-fast Computing Pipeline for Metagenome Analysis with Next-Generation DNA Sequencers GPU-Accelerated Large-Scale Simulation of

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

1 Past Research and Achievements

1 Past Research and Achievements Parallel Mesh Generation and Adaptation using MAdLib T. K. Sheel MEMA, Universite Catholique de Louvain Batiment Euler, Louvain-La-Neuve, BELGIUM Email: tarun.sheel@uclouvain.be 1 Past Research and Achievements

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

Fast multipole methods for axisymmetric geometries

Fast multipole methods for axisymmetric geometries Fast multipole methods for axisymmetric geometries Victor Churchill New York University May 13, 2016 Abstract Fast multipole methods (FMMs) are one of the main numerical methods used in the solution of

More information

arxiv: v4 [cs.na] 20 Aug 2012

arxiv: v4 [cs.na] 20 Aug 2012 FMM-based vortex method for simulation of isotropic turbulence on GPUs, compared with a method Rio Yokota a,, L. A. Barba a a Department of Mechanical Engineering, Boston University, Boston, MA, 5, USA.

More information

Toward efficient GPU-accelerated N -body simulations

Toward efficient GPU-accelerated N -body simulations 46th AIAA Aerospace Sciences Meeting and Exhibit 7-10 January 2008, Reno, Nevada AIAA 2008-608 Toward efficient GPU-accelerated N -body simulations Mark J. Stock and Adrin Gharakhani Applied Scientific

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview

More information

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX David Pfander*, Gregor Daiß*, Dominic Marcello**, Hartmut Kaiser**, Dirk Pflüger* * University of Stuttgart ** Louisiana State

More information

Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures

Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures Qi Hu huqi@cs.umd.edu Nail A. Gumerov gumerov@umiacs.umd.edu Ramani Duraiswami ramani@umiacs.umd.edu Institute for Advanced Computer

More information

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics

More information

Efficient computation of source magnetic scalar potential

Efficient computation of source magnetic scalar potential Adv. Radio Sci., 4, 59 63, 2006 Author(s) 2006. This work is licensed under a Creative Commons License. Advances in Radio Science Efficient computation of source magnetic scalar potential W. Hafla, A.

More information

Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units

Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN ENGINEERING Int. J. Numer. Meth. Engng 2012; 89:105 133 Published online 3 August 2011 in Wiley Online Library (wileyonlinelibrary.com)..3240 Optimizing the

More information

Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems for Programmability and Reliability

Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems for Programmability and Reliability Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems for Programmability and Reliability Atsushi Kawai, Kenji Yasuoka Department of Mechanical Engineering, Keio University Yokohama, Japan

More information

Optimisation Myths and Facts as Seen in Statistical Physics

Optimisation Myths and Facts as Seen in Statistical Physics Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY

More information

Software and Performance Engineering for numerical codes on GPU clusters

Software and Performance Engineering for numerical codes on GPU clusters Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3

More information

CUDA Experiences: Over-Optimization and Future HPC

CUDA Experiences: Over-Optimization and Future HPC CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Parallel Implementation of 3D FMA using MPI

Parallel Implementation of 3D FMA using MPI Parallel Implementation of 3D FMA using MPI Eric Jui-Lin Lu y and Daniel I. Okunbor z Computer Science Department University of Missouri - Rolla Rolla, MO 65401 Abstract The simulation of N-body system

More information

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E) FMM implementation on CPU and GPU Nail A. Gumerov (Lecture for CMSC 828E) Outline Two parts of the FMM Data Structure Flow Chart of the Run Algorithm FMM Cost/Optimization on CPU Programming on GPU Fast

More information

Intermediate Parallel Programming & Cluster Computing

Intermediate Parallel Programming & Cluster Computing High Performance Computing Modernization Program (HPCMP) Summer 2011 Puerto Rico Workshop on Intermediate Parallel Programming & Cluster Computing in conjunction with the National Computational Science

More information

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011

More information

Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem

Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem Guan Wang and Matthias K. Gobbert Department of Mathematics and Statistics, University of

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Tree-based methods on GPUs

Tree-based methods on GPUs Tree-based methods on GPUs Felipe Cruz 1 and Matthew Knepley 2,3 1 Department of Mathematics University of Bristol 2 Computation Institute University of Chicago 3 Department of Molecular Biology and Physiology

More information

Empirical Analysis of Space Filling Curves for Scientific Computing Applications

Empirical Analysis of Space Filling Curves for Scientific Computing Applications Empirical Analysis of Space Filling Curves for Scientific Computing Applications Daryl DeFord 1 Ananth Kalyanaraman 2 1 Dartmouth College Department of Mathematics 2 Washington State University School

More information

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on

More information

Analysis and Visualization Algorithms in VMD

Analysis and Visualization Algorithms in VMD 1 Analysis and Visualization Algorithms in VMD David Hardy Research/~dhardy/ NAIS: State-of-the-Art Algorithms for Molecular Dynamics (Presenting the work of John Stone.) VMD Visual Molecular Dynamics

More information

Fast Multipole Method on GPU: Tackling 3-D Capacitance Extraction on Massively Parallel SIMD Platforms

Fast Multipole Method on GPU: Tackling 3-D Capacitance Extraction on Massively Parallel SIMD Platforms Fast Multipole Method on GPU: Tackling 3-D Capacitance Extraction on Massively Parallel SIMD Platforms Xueqian Zhao Department of ECE Michigan Technological University Houghton, MI, 4993 xueqianz@mtu.edu

More information

Shallow Water Simulations on Graphics Hardware

Shallow Water Simulations on Graphics Hardware Shallow Water Simulations on Graphics Hardware Ph.D. Thesis Presentation 2014-06-27 Martin Lilleeng Sætra Outline Introduction Parallel Computing and the GPU Simulating Shallow Water Flow Topics of Thesis

More information

Visual Analysis of Lagrangian Particle Data from Combustion Simulations

Visual Analysis of Lagrangian Particle Data from Combustion Simulations Visual Analysis of Lagrangian Particle Data from Combustion Simulations Hongfeng Yu Sandia National Laboratories, CA Ultrascale Visualization Workshop, SC11 Nov 13 2011, Seattle, WA Joint work with Jishang

More information

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function

More information

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Splotch: High Performance Visualization using MPI, OpenMP and CUDA Splotch: High Performance Visualization using MPI, OpenMP and CUDA Klaus Dolag (Munich University Observatory) Martin Reinecke (MPA, Garching) Claudio Gheller (CSCS, Switzerland), Marzia Rivi (CINECA,

More information

Particle Simulator Research Team

Particle Simulator Research Team Particle Simulator Research Team 1. Team members Junichiro Makino (Team Leader) Keigo Nitadori (Research Scientist) Yutaka Maruyama (Research Scientist) Masaki Iwasawa (Postdoctoral Researcher) Ataru Tanikawa

More information

Transactions on Modelling and Simulation vol 20, 1998 WIT Press, ISSN X

Transactions on Modelling and Simulation vol 20, 1998 WIT Press,   ISSN X Parallel indirect multipole BEM analysis of Stokes flow in a multiply connected domain M.S. Ingber*, A.A. Mammoli* & J.S. Warsa* "Department of Mechanical Engineering, University of New Mexico, Albuquerque,

More information

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances) HPC and IT Issues Session Agenda Deployment of Simulation (Trends and Issues Impacting IT) Discussion Mapping HPC to Performance (Scaling, Technology Advances) Discussion Optimizing IT for Remote Access

More information

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011 ANSYS HPC Technology Leadership Barbara Hutchings barbara.hutchings@ansys.com 1 ANSYS, Inc. September 20, Why ANSYS Users Need HPC Insight you can t get any other way HPC enables high-fidelity Include

More information

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers Overlapping Computation and Communication for Advection on Hybrid Parallel Computers James B White III (Trey) trey@ucar.edu National Center for Atmospheric Research Jack Dongarra dongarra@eecs.utk.edu

More information

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung

More information

Stokes Preconditioning on a GPU

Stokes Preconditioning on a GPU Stokes Preconditioning on a GPU Matthew Knepley 1,2, Dave A. Yuen, and Dave A. May 1 Computation Institute University of Chicago 2 Department of Molecular Biology and Physiology Rush University Medical

More information

Adaptive fast multipole methods on the GPU

Adaptive fast multipole methods on the GPU J Supercomput (2013) 63:897 918 DOI 10.1007/s11227-012-0836-0 Adaptive fast multipole methods on the GPU Anders Goude Stefan Engblom Published online: 25 October 2012 The Author(s) 2012. This article is

More information

Using GPUs to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation

Using GPUs to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation Using GPUs to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation GPU Technology Conference 2012 May 15, 2012 Thomas M. Benson, Daniel P. Campbell, Daniel A. Cook thomas.benson@gtri.gatech.edu

More information

Large scale Imaging on Current Many- Core Platforms

Large scale Imaging on Current Many- Core Platforms Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,

More information

ANSYS HPC Technology Leadership

ANSYS HPC Technology Leadership ANSYS HPC Technology Leadership 1 ANSYS, Inc. November 14, Why ANSYS Users Need HPC Insight you can t get any other way It s all about getting better insight into product behavior quicker! HPC enables

More information

Modelling Multi-GPU Systems 1

Modelling Multi-GPU Systems 1 Modelling Multi-GPU Systems 1 Daniele G. SPAMPINATO a, Anne C. ELSTER a and Thorvald NATVIG a a Norwegian University of Science and Technology (NTNU), Trondheim, Norway Abstract. Due to the power and frequency

More information

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea

More information

A pedestrian introduction to fast multipole methods

A pedestrian introduction to fast multipole methods SCIENCE CHIN Mathematics. RTICLES. May 2012 Vol. 55 No. 5: 1043 1051 doi: 10.1007/s11425-012-4392-0 pedestrian introduction to fast multipole methods YING Lexing Department of Mathematics and ICES, University

More information

GPU Histogramming: Radial Distribution Functions

GPU Histogramming: Radial Distribution Functions GPU Histogramming: Radial Distribution Functions John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign

More information

A Peta-scale LES (Large-Eddy Simulation) for Turbulent Flows Based on Lattice Boltzmann Method

A Peta-scale LES (Large-Eddy Simulation) for Turbulent Flows Based on Lattice Boltzmann Method GTC (GPU Technology Conference) 2013, San Jose, 2013, March 20 A Peta-scale LES (Large-Eddy Simulation) for Turbulent Flows Based on Lattice Boltzmann Method Takayuki Aoki Global Scientific Information

More information

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters Auto-Generation and Auto-Tuning of 3D Stencil s on GPU Clusters Yongpeng Zhang, Frank Mueller North Carolina State University CGO 2012 Outline Motivation DSL front-end and Benchmarks Framework Experimental

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Direct Numerical Simulation of Turbulent Boundary Layers at High Reynolds Numbers.

Direct Numerical Simulation of Turbulent Boundary Layers at High Reynolds Numbers. Direct Numerical Simulation of Turbulent Boundary Layers at High Reynolds Numbers. G. Borrell, J.A. Sillero and J. Jiménez, Corresponding author: guillem@torroja.dmt.upm.es School of Aeronautics, Universidad

More information

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography 1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography

More information

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI Introduction to Parallel Programming for Multi/Many Clusters Part II-3: Parallel FVM using MPI Kengo Nakajima Information Technology Center The University of Tokyo 2 Overview Introduction Local Data Structure

More information

arxiv: v1 [physics.ins-det] 11 Jul 2015

arxiv: v1 [physics.ins-det] 11 Jul 2015 GPGPU for track finding in High Energy Physics arxiv:7.374v [physics.ins-det] Jul 5 L Rinaldi, M Belgiovine, R Di Sipio, A Gabrielli, M Negrini, F Semeria, A Sidoti, S A Tupputi 3, M Villa Bologna University

More information

Low-rank Properties, Tree Structure, and Recursive Algorithms with Applications. Jingfang Huang Department of Mathematics UNC at Chapel Hill

Low-rank Properties, Tree Structure, and Recursive Algorithms with Applications. Jingfang Huang Department of Mathematics UNC at Chapel Hill Low-rank Properties, Tree Structure, and Recursive Algorithms with Applications Jingfang Huang Department of Mathematics UNC at Chapel Hill Fundamentals of Fast Multipole (type) Method Fundamentals: Low

More information

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from

More information

Slat noise prediction with Fast Multipole BEM based on anisotropic synthetic turbulence sources

Slat noise prediction with Fast Multipole BEM based on anisotropic synthetic turbulence sources DLR.de Chart 1 Slat noise prediction with Fast Multipole BEM based on anisotropic synthetic turbulence sources Nils Reiche, Markus Lummer, Roland Ewert, Jan W. Delfs Institute of Aerodynamics and Flow

More information

Performance Benchmarking of Fast Multipole Methods. Thesis by Noha Ahmed Al-Harthi. In Partial Fulfillment of the Requirements.

Performance Benchmarking of Fast Multipole Methods. Thesis by Noha Ahmed Al-Harthi. In Partial Fulfillment of the Requirements. Performance Benchmarking of Fast Multipole Methods Thesis by Noha Ahmed Al-Harthi In Partial Fulfillment of the Requirements For the Degree of Masters of Science King Abdullah University of Science and

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Asynchronous OpenCL/MPI numerical simulations of conservation laws Asynchronous OpenCL/MPI numerical simulations of conservation laws Philippe HELLUY 1,3, Thomas STRUB 2. 1 IRMA, Université de Strasbourg, 2 AxesSim, 3 Inria Tonus, France IWOCL 2015, Stanford Conservation

More information

Session S0069: GPU Computing Advances in 3D Electromagnetic Simulation

Session S0069: GPU Computing Advances in 3D Electromagnetic Simulation Session S0069: GPU Computing Advances in 3D Electromagnetic Simulation Andreas Buhr, Alexander Langwost, Fabrizio Zanella CST (Computer Simulation Technology) Abstract Computer Simulation Technology (CST)

More information

Mapping MPI+X Applications to Multi-GPU Architectures

Mapping MPI+X Applications to Multi-GPU Architectures Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under

More information

Particle-based simulations in Astrophysics

Particle-based simulations in Astrophysics Particle-based simulations in Astrophysics Jun Makino Particle Simulator Research Team, AICS/ Earth-Life Science Institute(ELSI), Tokyo Institute of Technology Feb 28, 2013 3rd AICS International Symposium

More information

Key words. Poisson Solvers, Fast Fourier Transform, Fast Multipole Method, Multigrid, Parallel Computing, Exascale algorithms, Co-Design

Key words. Poisson Solvers, Fast Fourier Transform, Fast Multipole Method, Multigrid, Parallel Computing, Exascale algorithms, Co-Design FFT, FMM, OR MULTIGRID? A COMPARATIVE STUDY OF STATE-OF-THE-ART POISSON SOLVERS AMIR GHOLAMI, DHAIRYA MALHOTRA, HARI SUNDAR,, AND GEORGE BIROS Abstract. We discuss the fast solution of the Poisson problem

More information

how to efficiently evaluate (1)

how to efficiently evaluate (1) doi:10.1145/2160718.2160740 A Massively Parallel Adaptive Fast Multipole Method on Heterogeneous Architectures By Ilya Lashuk, Aparna Chandramowlishwaran, Harper Langston, Tuan-Anh Nguyen, Rahul Sampath,

More information

NAMD at Extreme Scale. Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale

NAMD at Extreme Scale. Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale NAMD at Extreme Scale Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale Overview NAMD description Power7 Tuning Support for

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

Accelerating Molecular Modeling Applications with Graphics Processors

Accelerating Molecular Modeling Applications with Graphics Processors Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference

More information

Evaluating the Performance and Energy Efficiency of N-Body Codes on Multi-Core CPUs and GPUs

Evaluating the Performance and Energy Efficiency of N-Body Codes on Multi-Core CPUs and GPUs Evaluating the Performance and Energy Efficiency of N-Body Codes on Multi-Core CPUs and GPUs Ivan Zecena 1, Martin Burtscher 1, Tongdan Jin 2, Ziliang Zong 1 1 Department of Computer Science, Texas State

More information

Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer

Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer Xiaoqian Zhu 1,2, Xin Liu 1, Xiangfei Meng 2, Jinghua Feng 2 1 School of Computer, National University of Defense

More information

A Comprehensive Study on the Performance of Implicit LS-DYNA

A Comprehensive Study on the Performance of Implicit LS-DYNA 12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four

More information

Terascale on the desktop: Fast Multipole Methods on Graphical Processors

Terascale on the desktop: Fast Multipole Methods on Graphical Processors Terascale on the desktop: Fast Multipole Methods on Graphical Processors Nail A. Gumerov Fantalgo, LLC Institute for Advanced Computer Studies University of Maryland (joint work with Ramani Duraiswami)

More information

D036 Accelerating Reservoir Simulation with GPUs

D036 Accelerating Reservoir Simulation with GPUs D036 Accelerating Reservoir Simulation with GPUs K.P. Esler* (Stone Ridge Technology), S. Atan (Marathon Oil Corp.), B. Ramirez (Marathon Oil Corp.) & V. Natoli (Stone Ridge Technology) SUMMARY Over the

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

What does Fusion mean for HPC?

What does Fusion mean for HPC? What does Fusion mean for HPC? Casey Battaglino Aparna Chandramowlishwaran Jee Choi Kent Czechowski Cong Hou Chris McClanahan Dave S. Noble, Jr. Richard (Rich) Vuduc AMD Fusion Developers Summit Bellevue,

More information

LARGE-SCALE FREE-SURFACE FLOW SIMULATION USING LATTICE BOLTZMANN METHOD ON MULTI-GPU CLUSTERS

LARGE-SCALE FREE-SURFACE FLOW SIMULATION USING LATTICE BOLTZMANN METHOD ON MULTI-GPU CLUSTERS ECCOMAS Congress 2016 VII European Congress on Computational Methods in Applied Sciences and Engineering M. Papadrakakis, V. Papadopoulos, G. Stefanou, V. Plevris (eds.) Crete Island, Greece, 5 10 June

More information