The Fast Multipole Method on NVIDIA GPUs and Multicore Processors

Similar documents
Cauchy Fast Multipole Method for Analytic Kernel

Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units

A Kernel-independent Adaptive Fast Multipole Method

Center for Computational Science

Fast Multipole and Related Algorithms

Fast Multipole Method on the GPU

Kernel Independent FMM

Iterative methods for use with the Fast Multipole Method

Using GPUs to compute the multilevel summation of electrostatic forces

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

ExaFMM. Fast multipole method software aiming for exascale systems. User's Manual. Rio Yokota, L. A. Barba. November Revision 1

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

cuibm A GPU Accelerated Immersed Boundary Method

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute

8. Hardware-Aware Numerics. Approaching supercomputing...

Stokes Preconditioning on a GPU

8. Hardware-Aware Numerics. Approaching supercomputing...

Accelerating Molecular Modeling Applications with Graphics Processors

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

A Comprehensive Study on the Performance of Implicit LS-DYNA

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

On the limits of (and opportunities for?) GPU acceleration

Task-based FMM for heterogeneous architectures

Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures

MAGMA: a New Generation

Large scale Imaging on Current Many- Core Platforms

Fast multipole methods for axisymmetric geometries

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Solving Dense Linear Systems on Graphics Processors

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

QR Decomposition on GPUs

GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

Efficient O(N log N) algorithms for scattered data interpolation

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

An Approximate Singular Value Decomposition of Large Matrices in Julia

Optimisation Myths and Facts as Seen in Statistical Physics

CME 213 S PRING Eric Darve

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

Modern GPUs (Graphics Processing Units)

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Empirical Analysis of Space Filling Curves for Scientific Computing Applications

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Tree-based methods on GPUs

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Fast and reliable linear system solutions on new parallel architectures

Automated Finite Element Computations in the FEniCS Framework using GPUs

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

A Random Variable Shape Parameter Strategy for Radial Basis Function Approximation Methods

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Accelerating Implicit LS-DYNA with GPU

Applications of Berkeley s Dwarfs on Nvidia GPUs

Parallel and Distributed Systems Lab.

1.2 Numerical Solutions of Flow Problems

Nasim Mahmood, Guosheng Deng, and James C. Browne

Sparse Matrices in Image Compression

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA

OpenFOAM + GPGPU. İbrahim Özküçük

CUDA Experiences: Over-Optimization and Future HPC

Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

Hierarchical Low-Rank Approximation Methods on Distributed Memory and GPUs

Fast Methods with Sieve

Journal of Engineering Research and Studies E-ISSN

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

Two-Phase flows on massively parallel multi-gpu clusters

FFT, FMM, OR MULTIGRID? A COMPARATIVE STUDY OF STATE-OF-THE-ART POISSON SOLVERS FOR UNIFORM AND NON-UNIFORM GRIDS IN THE UNIT CUBE

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

Parallel Interpolation in FSI Problems Using Radial Basis Functions and Problem Size Reduction

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

A comparison of parallel rank-structured solvers

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

How HPC Hardware and Software are Evolving Towards Exascale

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Long time integrations of a convective PDE on the sphere by RBF collocation

Transcription:

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors Toru Takahashi, a Cris Cecka, b Eric Darve c a b c Department of Mechanical Science and Engineering, Nagoya University Institute for Applied Computational Science, Harvard University Institute for Computational and Mathematical Engineering, Stanford University May 17 2012 Takahashi, Cecka, Darve FMM 1 / 25

Outline 1 Fast multipole method 2 Multipole-to-local operator: blocking and data access pattern 3 Particle-particle interactions 4 Numerical results: multicore CPU and NVIDIA GPU Takahashi, Cecka, Darve FMM 2 / 25

Fast Multipole Method Applications: Gravitational simulations, molecular dynamics, electrostatic forces Integral equations for Laplace, Poisson, Stokes, Navier. Both fluid and solid mechanics. Interpolation using radial basis functions: meshless methods, graphics (data smoothing) Imaging and inverse problems: subsurface characterization and monitoring, imaging O(N) fast linear algebra: fast matrix-vector and matrix-matrix products, matrix inverse, LU factorization, for dense and sparse matrices, FastLAPACK Takahashi, Cecka, Darve FMM 3 / 25

Matrix-vector products FMM: a method to calculate N φ(x i ) = K(x i, x j ) σ j, j=1 1 i N in O(N) or O(N ln α N) arithmetic operations. There are many different FMM formulations. Most of them rely on low-rank approximations of the kernel function K in the form: p K(x, y) = u q (x) q=1 p T qs v s (y) + ε s=1 Takahashi, Cecka, Darve FMM 4 / 25

FMM low-rank approximation Many formulas are possible to obtain a low-rank approximation. We use an approach based on interpolation formulas for smooth functions. This allows building an FMM that is: Very general: any smooth (even oscillatory) kernels can be used. Simple to apply: we only need a routine that evaluates K. Converges rapidly and has optimal rank given an L 2 error. It is however less efficient than specialized methods for K(x, y) = 1/ x y. Takahashi, Cecka, Darve FMM 5 / 25

Interpolation scheme Low-rank approximation is obtained through a Chebyshev interpolation scheme: ( ) l λ K(x, y) 2 k+1 R n ( x l, x) K( x l, ȳ m ) R n (ȳ m, y) l m x 0 x 1 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 W. Fong, E. Darve, The black-box fast multipole method, J. Comput. Phys., 228(23) (2009) 8712 8725. Takahashi, Cecka, Darve FMM 6 / 25

FMM operators x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 L2L M2M L2L x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 M2M and L2L operators are transpose of each other. The operator depends only on the relative position of each cluster. It is independent of the location of particles inside each cluster. Takahashi, Cecka, Darve FMM 7 / 25

C C C 0 C 0 r Data structures and well-separated j r j of clusters P j K(ri,rj) j. each octant into smaller octants. This process is recursively repeated until the clusters are small enough tha interactions between particles in neighboring clusters can be computed directly, that is by direct evaluatio In a typical situation, a cluster is required to interact only with clusters that are neighbors (i.e., they shar a vertex, edge, or face) of their parent cluster. In addition, since they cannot interact with clusters that ar their own neighbors, each cluster is therefore only required to interact with a relatively small number o Figure 1: Figure Two 1: spheres Two spheres containing containing r i and r ji illustrating and r j illustrating the restricted the restricted domain domain of validity of validity of (1). of (1). clusters, at most 189 (= 6 3 3 3 ) in 3D. This is shown in Figure 3. E. Darve, C. Cecka, T. Takahashi (a) (a) (b) (b) Figure 3: 2D illustration: basic interaction list. In a typical situation, the gray cluster near the center has t interact with all the light gray clusters around it. In 3D there are 189 of them. Figure 2: Figure 2D illustration: 2: 2D illustration: (a) A square (a) A is square subdivided is subdivided into four into quadrants; four quadrants; approximation approximation (1) (well-separated (1) (well-separated condition) condition) does not does apply notoapply any pair to any of quadrants. pair of quadrants. (b) Each(b) quadrant Each quadrant from Figure from 2(a) Figure is further 2(a) is further subdivided subdivided into 4 quadrants. into 4 quadrants. The interaction The interaction between between particles particles in the two in the gray two clusters gray clusters can Treenow Root can benow computed be computed using the using the low-rank low-rank approximation approximation (1). The(1). interaction The interaction between between the graythe cluster gray and cluster its neighboring and its neighboring clusters clusters (with the (with the diagonal diagonal pattern) pattern) can notcan be computed not be computed at this level at this oflevel the tree. of the A tree. further A further decomposition decomposition of eachof quadrant each quadrant is is required. required. Upward Pass Downward Pass Instead Instead each octant eachis octant further is further decomposed decomposed M into octants as shown in Figure 2(b). This allows computing C0 into octants as shown in Figure 2(b). This allows computing L C0 s q all interactions all interactions between between pairs ofpairs clusters of clusters that arethat not nearest not nearest neighbors, neighbors, e.g., thee.g., pairthe of pair grayof clusters gray clusters in in Figure Figure 2(b). Interactions 2(b). Interactions betweenm2m particles particles in neighboring in neighboring clusters clusters are computed are computed by further by further subdividing subdividing L2L each octant eachinto octant smaller into smaller octants. octants. This process This process is recursively is recursively repeated repeated until theuntil clusters the clusters are small areenough small enough that that interactions between particles in neighboring clusters can be computed directly, that is by direct evaluation of P interactions between particles in neighboring clusters can be computed directly, that is by direct evaluation j K(ri,rj) of P j K(ri,rj) j. j. Ms C L C q In a typical In a typical situation, situation, a cluster a is cluster required is required to interact to interact only with only clusters with clusters that arethat neighbors are neighbors (i.e., they (i.e., share they share a vertex, E. Darve, a vertex, edge, C. oredge, Cecka, face) or of and face) their T. of parent Takahashi. theircluster. parent The cluster. In addition, fast Figure In multipole addition, since 4: Illustration they method since cannot they on of parallel cannot interact the three interact clusters, with passes clusters with multicore inclusters the thatfmm. are processors, are and graphics processing units. theircomptes own neighbors, own Rendus neighbors, each Mécanique, cluster each cluster is 339(2 3):185 193, therefore is therefore only required only 2011. required to interact to interact with a relatively with a relatively small number small number of of clusters, clusters, at mostat 189 most (= 189 Takahashi, 6 3 (= 3 3 6) 3 in Cecka, 3D. 3 3 ) This in 3D. Darve is shown This is in shown Figure in 3. Figure 3. FMM 8 / 25

operator The operator K( x l, ȳ m ) needs to be applied many times. It is independent of the particles positions x i. This allows certain optimizations to be computed off-line. A singular value decomposition is used to compress the operator K( x l, ȳ m ) to an optimal rank: K IJ = = Ω 1/2 o U Σ V T Ω 1/2 s = K IJ σ J = ( σ J ) Cost = 2rN instead of N 2 Takahashi, Cecka, Darve FMM 9 / 25

Convergence of Chebyshev interpolation and SVD compression omputational time in R 2 ; is the prescribed accuracy for ACA; is the Chebyshev interpolation order. 10 0 10 5 10 10 10 15 n = 3 n = 4 n = 5 n = 6 n = 7 n = 8 n = 9 n = 10 0 200 400 600 800 1,000 Relative error ε L 2 10-2 10-4 10-6 10-8 10-10 10-12 l=4 l=5 l=6 l=7 l=8 l=9 l=10 l=11 l=12 l=13 10-2 Target accuracy ε ACA 10-4 10-6 10-8 10-10 10-12 Fig. 8. Example geometries. Error vs SVD cutoff for K = 1/r 2D prolate mesh Error vs target accuracy for K = e ikr /r e not developed M. Messner, a method M. to Schanz, optimize and thee. parameters Darve. Fast in the directional method. Asmultilevel a result thesummation running times for oscillatory kernels be improved by changing our choice of parameters. In order to be consistent throughout all studies, based on Chebyshev interpolation. J. Comp. Phys., 2011. he following parameter setting. Recall that w and k denote the length of the side of a cluster and the Takahashi, Cecka, Darve FMM 10 / 25

operator The operator, inside a given level, leads to a set of nested loops: For all cluster C in this level { For all cluster C in the interaction list of C { For all q=1 to p { For all s=1 to p { L(q)(C) += T(q,s)(C,C ) M(s)(C ); } } } } C and C are clusters that are well separated. The matrix T(q,s) is the operator. The vector M has the multipole coefficients and L the local coefficients. The core operation is therefore a matrix-vector product. Takahashi, Cecka, Darve FMM 11 / 25

Blocking strategies Four different strategies: 1 Basic approach: matrix-vector products. 2 Sibling blocking. Reuse T(q,s)(C,C ) for all (C,C ) that share the same T. This allows using matrix-matrix products instead of matrix-vector products. 64 MV 27 MM T(q,s)(C,C ) S 0 S 1 S 2 S 3 S 0 S 1 S 0 S 1 S 0 S 1 S 2 S 3 O 0 O 1 O 0 O 1 S 2 S 3 O 0 O 1 S 2 S 3 O 0 O 1 O 2 O 3 O 2 O 3 O 2 O 3 O 2 O 3 3 Further block siblings into larger chunks. This is possible if the number of clusters is sufficient. Takahashi, Cecka, Darve FMM 12 / 25

Fourth blocking strategy C C 4 Move the q s loop outside. This allows the greatest degree of reuse of T(q,s)(C,C ) across all 316 pairs of clusters. Downside: calculation is slightly more irregular due to the different treatment of each sibling. This only works at level 3 and below. T. Takahashi, C. Cecka, and E. Darve. Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units. Int. J. Num. Meth. Engr., 2011. Takahashi, Cecka, Darve FMM 13 / 25

Arithmetic intensity Algorithm variant Arithmetic intensity Algorithm 1 1.9 2.0 Algorithm 2 4.2 4.3 Algorithm 3 21 32 Algorithm 4 108 133 Platform and algorithm Gflops CPU 125.2 GPU A1 72.0 GPU A2 146.9 GPU A3 257.3 GPU A4 358.8 : memory bound; arithmetic bound Left: ratio of number of flops performed by kernel divided by number of words read from memory. Arithmetic intensity should be around 30 35 for peak efficiency. Right: Performance of in single-precision. Tech. details: medium accuracy, rank = 32, N = 10 7, K = 1/r, cube with random uniform dist., DELL Poweredge 1950 with two quad-core Intel Xeon E5345 2.33 GHz CPUs (OpenMP), Tesla C2050. Takahashi, Cecka, Darve FMM 14 / 25

Comparison of algorithms on CPU (under development) Alg. P n d N = 10 4 10 5 10 6 10 7 10 8 CPU-6x 2 s 4 0 56(3) 64(4) 65(5) 65(6) 65(7) CPU-6x 4 s 4 0 5(3) 12(4) 22(5) 25(6) 27(7) CPU-6x 2 s 8-1 24(2) 28(3) 31(4) 31(5) 32(6) CPU-6x 4 s 8-1 1(2) 5(3) 13(4) 22(5) 27(6) Gflops in computation (tree depth); algo. 2 and 4 on CPU 6-core Xeon X5680 3.33GHz P: precision [s d]; n: parameter that determines the accuracy [4 8] d: parameter to adjust the depth of the tree; N: number of particles Takahashi, Cecka, Darve FMM 15 / 25

Short-range interactions Assign one thread-block to one leaf box T. I : number of threads per thread-block. I threads process I observation points in T each iteration. For every source box S in N (T ), all the threads load data in shared memory (x j and σ j ) for J sources in S. The loop over sources (summation over j) is unrolled to improve performance. With this approach a high-arithmetic intensity can be reached. Leaf box T Group of I observation points x i Thread copies data to shared memory Thread computes (i, j) interactions J sources with charge σ j at x j Source box S Takahashi, Cecka, Darve FMM 16 / 25

Validation of both codes P n d 10 4 10 5 10 6 CPU s 4 0 GPU s 4 0 7.39e-4 6.65e-4 7.89e-4 7.39e-4 6.65e-4 7.89e-4 CPU s 8-1 3.74e-6 8.05e-6 1.92e-5 GPU s 8-1 3.72e-6 8.07e-6 1.92e-5 CPU d 4 0 GPU d 4 0 7.32e-4 6.67e-4 7.80e-4 7.32e-4 6.67e-4 7.80e-4 CPU d 8-1 2.13e-7 2.25e-7 2.84e-7 GPU d 8-1 2.13e-7 2.25e-7 2.84e-7 Relative l 2 error of the CPU and GPU codes versus N 2 code. K(x, y) = 1/ x y. Points are randomly uniformly distributed in a box. Takahashi, Cecka, Darve FMM 17 / 25

Hardware used for tests Peak performance: top is single precision while bottom is double precision. Processor Xeon X5680 Tesla C2050 six-core, 3.33GHz two sockets # of cores in total 12 448 Peak performance 319.68 1030 [Gflop/s] 159.84 515 Bandwidth [GB/s] 64 115 Flop-to-word ratio 20 36 Hyper-threading is not used. Performance for MAD operation. 32 GB/s per socket. ECC is enabled. Takahashi, Cecka, Darve FMM 18 / 25

Benchmark results: serial CPU breakdown Time [%] Time [%] CPU code (Single precision, n=4, d=0) 100 80 60 40 20 0 10 4 10 5 10 6 10 7 10 8 N CPU code (Double precision, n=4, d=0) 100 80 60 40 direct L2P L2L M2M P2M setup direct L2P L2L M2M P2M setup Time [%] Time [%] CPU code (Single precision, n=8, d=-1) 100 80 60 40 20 0 10 4 10 5 10 6 10 7 10 8 N CPU code (Double precision, n=8, d=-1) 100 80 60 40 direct L2P L2L M2M P2M setup direct L2P L2L M2M P2M setup P = particle M = multipole exp. L = local exp. Direct (P2P for leaf clusters) and occupy 60 98% of the total running time. 20 20 0 0 10 4 10 5 10 6 10 7 10 4 10 5 10 6 10 7 N N Takahashi, Cecka, Darve FMM 19 / 25

Benchmark results: CPU speedup using 12 cores (OpenMP) Single precision Speed-up 12 11 10 9 8 7 6 5 4 3 2 1 0 CPU code (Single precision, n=4, d=0) 10 4 10 5 10 6 10 7 10 8 N Total direct L2P L2L M2M P2M setup Speed-up 12 11 10 9 8 7 6 5 4 3 2 1 0 CPU code (Single precision, n=8, d=-1) 10 4 10 5 10 6 10 7 10 8 N Total direct L2P L2L M2M P2M setup Takahashi, Cecka, Darve FMM 20 / 25

Benchmark results: CPU speedup using 12 cores (OpenMP) Double precision Speed-up 12 11 10 9 8 7 6 5 4 3 2 1 0 CPU code (Double precision, n=4, d=0) 10 4 10 5 10 6 10 7 N Total direct L2P L2L M2M P2M setup Speed-up 12 11 10 9 8 7 6 5 4 3 2 1 0 CPU code (Double precision, n=8, d=-1) 10 4 10 5 10 6 10 7 N Total direct L2P L2L M2M P2M setup Takahashi, Cecka, Darve FMM 21 / 25

Speed-up of GPU vs CPU-12x Speed-up GPU vs CPU (Single precision, n=4, d=0) 3 2 1 Total direct Speed-up GPU vs CPU (Single precision, n=8, d=-1) 8 7 6 5 4 3 2 Total direct 1 0 10 4 10 5 10 6 10 7 10 8 N 0 10 4 10 5 10 6 10 7 10 8 N GPU vs CPU (Double precision, n=4, d=0) 4 3 Total direct GPU vs CPU (Double precision, n=8, d=-1) 7 6 5 Total direct Speed-up 2 Speed-up 4 3 1 2 1 0 10 4 10 5 10 6 10 7 N 0 10 4 10 5 10 6 10 7 N Takahashi, Cecka, Darve FMM 22 / 25

Gflops performance: computation P n d 10 4 10 5 10 6 10 7 10 8 CPU-12x s 4 0 95(3) 120(4) 123(5) 126(6) 128(7) GPU s 4 0 50(3) 173(4) 302(5) 359(6) 379(7) CPU-12x s 8 1 35(2) 39(3) 41(4) 43(5) 46(6) GPU s 8 1 63(2) 65(3) 192(4) 319(5) 375(6) CPU-12x d 4 0 58(3) 73(4) 75(5) 75(6) NA GPU d 4 0 26(3) 83(4) 140(5) 165(6) NA CPU-12x d 8 1 17(2) 18(3) 19(4) 20(5) NA GPU d 8 1 41(2) 30(3) 88(4) 146(5) NA Takahashi, Cecka, Darve FMM 23 / 25

Gflops performance: P2P P2P computation P n d 10 4 10 5 10 6 10 7 10 8 CPU-12x s 4 0 144(3) 169(4) 182(5) 194(6) 210(7) GPU s 4 0 192(3) 282(4) 376(5) 491(6) 629(7) CPU-12x s 8 1 221(2) 261(3) 275(4) 282(5) 286(6) GPU s 8 1 373(2) 705(3) 826(4) 852(5) 861(6) CPU-12x d 4 0 35(3) 38(4) 39(5) 40(6) NA GPU d 4 0 69(3) 96(4) 124(5) 157(6) NA CPU-12x d 8 1 33(2) 38(3) 40(4) 41(5) NA GPU d 8 1 98(2) 207(3) 243(4) 249(5) NA Takahashi, Cecka, Darve FMM 24 / 25

Conclusion Efficient and flexible fast multipole method: can be used for any analytic kernel, smooth or oscillatory. Good fraction of peak performance can be achieved. There is still room for improvement (e.g., instruction throughput optimization). We have not considered the case of an adaptive tree with a varying number of levels. Current work: use starpu to parallelize the method on multiple GPUs and CPUs. Overcome the tree-root bottleneck by dynamically scheduling the M2M,, L2L and P2P operators. Increase arithmetic intensity using symmetries. The code, bbfmm, used for this work is available from http://sourceforge.net/projects/bbfmmgpu Takahashi, Cecka, Darve FMM 25 / 25