The Fast Multipole Method on NVIDIA GPUs and Multicore Processors Toru Takahashi, a Cris Cecka, b Eric Darve c a b c Department of Mechanical Science and Engineering, Nagoya University Institute for Applied Computational Science, Harvard University Institute for Computational and Mathematical Engineering, Stanford University May 17 2012 Takahashi, Cecka, Darve FMM 1 / 25
Outline 1 Fast multipole method 2 Multipole-to-local operator: blocking and data access pattern 3 Particle-particle interactions 4 Numerical results: multicore CPU and NVIDIA GPU Takahashi, Cecka, Darve FMM 2 / 25
Fast Multipole Method Applications: Gravitational simulations, molecular dynamics, electrostatic forces Integral equations for Laplace, Poisson, Stokes, Navier. Both fluid and solid mechanics. Interpolation using radial basis functions: meshless methods, graphics (data smoothing) Imaging and inverse problems: subsurface characterization and monitoring, imaging O(N) fast linear algebra: fast matrix-vector and matrix-matrix products, matrix inverse, LU factorization, for dense and sparse matrices, FastLAPACK Takahashi, Cecka, Darve FMM 3 / 25
Matrix-vector products FMM: a method to calculate N φ(x i ) = K(x i, x j ) σ j, j=1 1 i N in O(N) or O(N ln α N) arithmetic operations. There are many different FMM formulations. Most of them rely on low-rank approximations of the kernel function K in the form: p K(x, y) = u q (x) q=1 p T qs v s (y) + ε s=1 Takahashi, Cecka, Darve FMM 4 / 25
FMM low-rank approximation Many formulas are possible to obtain a low-rank approximation. We use an approach based on interpolation formulas for smooth functions. This allows building an FMM that is: Very general: any smooth (even oscillatory) kernels can be used. Simple to apply: we only need a routine that evaluates K. Converges rapidly and has optimal rank given an L 2 error. It is however less efficient than specialized methods for K(x, y) = 1/ x y. Takahashi, Cecka, Darve FMM 5 / 25
Interpolation scheme Low-rank approximation is obtained through a Chebyshev interpolation scheme: ( ) l λ K(x, y) 2 k+1 R n ( x l, x) K( x l, ȳ m ) R n (ȳ m, y) l m x 0 x 1 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 W. Fong, E. Darve, The black-box fast multipole method, J. Comput. Phys., 228(23) (2009) 8712 8725. Takahashi, Cecka, Darve FMM 6 / 25
FMM operators x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 L2L M2M L2L x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 M2M and L2L operators are transpose of each other. The operator depends only on the relative position of each cluster. It is independent of the location of particles inside each cluster. Takahashi, Cecka, Darve FMM 7 / 25
C C C 0 C 0 r Data structures and well-separated j r j of clusters P j K(ri,rj) j. each octant into smaller octants. This process is recursively repeated until the clusters are small enough tha interactions between particles in neighboring clusters can be computed directly, that is by direct evaluatio In a typical situation, a cluster is required to interact only with clusters that are neighbors (i.e., they shar a vertex, edge, or face) of their parent cluster. In addition, since they cannot interact with clusters that ar their own neighbors, each cluster is therefore only required to interact with a relatively small number o Figure 1: Figure Two 1: spheres Two spheres containing containing r i and r ji illustrating and r j illustrating the restricted the restricted domain domain of validity of validity of (1). of (1). clusters, at most 189 (= 6 3 3 3 ) in 3D. This is shown in Figure 3. E. Darve, C. Cecka, T. Takahashi (a) (a) (b) (b) Figure 3: 2D illustration: basic interaction list. In a typical situation, the gray cluster near the center has t interact with all the light gray clusters around it. In 3D there are 189 of them. Figure 2: Figure 2D illustration: 2: 2D illustration: (a) A square (a) A is square subdivided is subdivided into four into quadrants; four quadrants; approximation approximation (1) (well-separated (1) (well-separated condition) condition) does not does apply notoapply any pair to any of quadrants. pair of quadrants. (b) Each(b) quadrant Each quadrant from Figure from 2(a) Figure is further 2(a) is further subdivided subdivided into 4 quadrants. into 4 quadrants. The interaction The interaction between between particles particles in the two in the gray two clusters gray clusters can Treenow Root can benow computed be computed using the using the low-rank low-rank approximation approximation (1). The(1). interaction The interaction between between the graythe cluster gray and cluster its neighboring and its neighboring clusters clusters (with the (with the diagonal diagonal pattern) pattern) can notcan be computed not be computed at this level at this oflevel the tree. of the A tree. further A further decomposition decomposition of eachof quadrant each quadrant is is required. required. Upward Pass Downward Pass Instead Instead each octant eachis octant further is further decomposed decomposed M into octants as shown in Figure 2(b). This allows computing C0 into octants as shown in Figure 2(b). This allows computing L C0 s q all interactions all interactions between between pairs ofpairs clusters of clusters that arethat not nearest not nearest neighbors, neighbors, e.g., thee.g., pairthe of pair grayof clusters gray clusters in in Figure Figure 2(b). Interactions 2(b). Interactions betweenm2m particles particles in neighboring in neighboring clusters clusters are computed are computed by further by further subdividing subdividing L2L each octant eachinto octant smaller into smaller octants. octants. This process This process is recursively is recursively repeated repeated until theuntil clusters the clusters are small areenough small enough that that interactions between particles in neighboring clusters can be computed directly, that is by direct evaluation of P interactions between particles in neighboring clusters can be computed directly, that is by direct evaluation j K(ri,rj) of P j K(ri,rj) j. j. Ms C L C q In a typical In a typical situation, situation, a cluster a is cluster required is required to interact to interact only with only clusters with clusters that arethat neighbors are neighbors (i.e., they (i.e., share they share a vertex, E. Darve, a vertex, edge, C. oredge, Cecka, face) or of and face) their T. of parent Takahashi. theircluster. parent The cluster. In addition, fast Figure In multipole addition, since 4: Illustration they method since cannot they on of parallel cannot interact the three interact clusters, with passes clusters with multicore inclusters the thatfmm. are processors, are and graphics processing units. theircomptes own neighbors, own Rendus neighbors, each Mécanique, cluster each cluster is 339(2 3):185 193, therefore is therefore only required only 2011. required to interact to interact with a relatively with a relatively small number small number of of clusters, clusters, at mostat 189 most (= 189 Takahashi, 6 3 (= 3 3 6) 3 in Cecka, 3D. 3 3 ) This in 3D. Darve is shown This is in shown Figure in 3. Figure 3. FMM 8 / 25
operator The operator K( x l, ȳ m ) needs to be applied many times. It is independent of the particles positions x i. This allows certain optimizations to be computed off-line. A singular value decomposition is used to compress the operator K( x l, ȳ m ) to an optimal rank: K IJ = = Ω 1/2 o U Σ V T Ω 1/2 s = K IJ σ J = ( σ J ) Cost = 2rN instead of N 2 Takahashi, Cecka, Darve FMM 9 / 25
Convergence of Chebyshev interpolation and SVD compression omputational time in R 2 ; is the prescribed accuracy for ACA; is the Chebyshev interpolation order. 10 0 10 5 10 10 10 15 n = 3 n = 4 n = 5 n = 6 n = 7 n = 8 n = 9 n = 10 0 200 400 600 800 1,000 Relative error ε L 2 10-2 10-4 10-6 10-8 10-10 10-12 l=4 l=5 l=6 l=7 l=8 l=9 l=10 l=11 l=12 l=13 10-2 Target accuracy ε ACA 10-4 10-6 10-8 10-10 10-12 Fig. 8. Example geometries. Error vs SVD cutoff for K = 1/r 2D prolate mesh Error vs target accuracy for K = e ikr /r e not developed M. Messner, a method M. to Schanz, optimize and thee. parameters Darve. Fast in the directional method. Asmultilevel a result thesummation running times for oscillatory kernels be improved by changing our choice of parameters. In order to be consistent throughout all studies, based on Chebyshev interpolation. J. Comp. Phys., 2011. he following parameter setting. Recall that w and k denote the length of the side of a cluster and the Takahashi, Cecka, Darve FMM 10 / 25
operator The operator, inside a given level, leads to a set of nested loops: For all cluster C in this level { For all cluster C in the interaction list of C { For all q=1 to p { For all s=1 to p { L(q)(C) += T(q,s)(C,C ) M(s)(C ); } } } } C and C are clusters that are well separated. The matrix T(q,s) is the operator. The vector M has the multipole coefficients and L the local coefficients. The core operation is therefore a matrix-vector product. Takahashi, Cecka, Darve FMM 11 / 25
Blocking strategies Four different strategies: 1 Basic approach: matrix-vector products. 2 Sibling blocking. Reuse T(q,s)(C,C ) for all (C,C ) that share the same T. This allows using matrix-matrix products instead of matrix-vector products. 64 MV 27 MM T(q,s)(C,C ) S 0 S 1 S 2 S 3 S 0 S 1 S 0 S 1 S 0 S 1 S 2 S 3 O 0 O 1 O 0 O 1 S 2 S 3 O 0 O 1 S 2 S 3 O 0 O 1 O 2 O 3 O 2 O 3 O 2 O 3 O 2 O 3 3 Further block siblings into larger chunks. This is possible if the number of clusters is sufficient. Takahashi, Cecka, Darve FMM 12 / 25
Fourth blocking strategy C C 4 Move the q s loop outside. This allows the greatest degree of reuse of T(q,s)(C,C ) across all 316 pairs of clusters. Downside: calculation is slightly more irregular due to the different treatment of each sibling. This only works at level 3 and below. T. Takahashi, C. Cecka, and E. Darve. Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units. Int. J. Num. Meth. Engr., 2011. Takahashi, Cecka, Darve FMM 13 / 25
Arithmetic intensity Algorithm variant Arithmetic intensity Algorithm 1 1.9 2.0 Algorithm 2 4.2 4.3 Algorithm 3 21 32 Algorithm 4 108 133 Platform and algorithm Gflops CPU 125.2 GPU A1 72.0 GPU A2 146.9 GPU A3 257.3 GPU A4 358.8 : memory bound; arithmetic bound Left: ratio of number of flops performed by kernel divided by number of words read from memory. Arithmetic intensity should be around 30 35 for peak efficiency. Right: Performance of in single-precision. Tech. details: medium accuracy, rank = 32, N = 10 7, K = 1/r, cube with random uniform dist., DELL Poweredge 1950 with two quad-core Intel Xeon E5345 2.33 GHz CPUs (OpenMP), Tesla C2050. Takahashi, Cecka, Darve FMM 14 / 25
Comparison of algorithms on CPU (under development) Alg. P n d N = 10 4 10 5 10 6 10 7 10 8 CPU-6x 2 s 4 0 56(3) 64(4) 65(5) 65(6) 65(7) CPU-6x 4 s 4 0 5(3) 12(4) 22(5) 25(6) 27(7) CPU-6x 2 s 8-1 24(2) 28(3) 31(4) 31(5) 32(6) CPU-6x 4 s 8-1 1(2) 5(3) 13(4) 22(5) 27(6) Gflops in computation (tree depth); algo. 2 and 4 on CPU 6-core Xeon X5680 3.33GHz P: precision [s d]; n: parameter that determines the accuracy [4 8] d: parameter to adjust the depth of the tree; N: number of particles Takahashi, Cecka, Darve FMM 15 / 25
Short-range interactions Assign one thread-block to one leaf box T. I : number of threads per thread-block. I threads process I observation points in T each iteration. For every source box S in N (T ), all the threads load data in shared memory (x j and σ j ) for J sources in S. The loop over sources (summation over j) is unrolled to improve performance. With this approach a high-arithmetic intensity can be reached. Leaf box T Group of I observation points x i Thread copies data to shared memory Thread computes (i, j) interactions J sources with charge σ j at x j Source box S Takahashi, Cecka, Darve FMM 16 / 25
Validation of both codes P n d 10 4 10 5 10 6 CPU s 4 0 GPU s 4 0 7.39e-4 6.65e-4 7.89e-4 7.39e-4 6.65e-4 7.89e-4 CPU s 8-1 3.74e-6 8.05e-6 1.92e-5 GPU s 8-1 3.72e-6 8.07e-6 1.92e-5 CPU d 4 0 GPU d 4 0 7.32e-4 6.67e-4 7.80e-4 7.32e-4 6.67e-4 7.80e-4 CPU d 8-1 2.13e-7 2.25e-7 2.84e-7 GPU d 8-1 2.13e-7 2.25e-7 2.84e-7 Relative l 2 error of the CPU and GPU codes versus N 2 code. K(x, y) = 1/ x y. Points are randomly uniformly distributed in a box. Takahashi, Cecka, Darve FMM 17 / 25
Hardware used for tests Peak performance: top is single precision while bottom is double precision. Processor Xeon X5680 Tesla C2050 six-core, 3.33GHz two sockets # of cores in total 12 448 Peak performance 319.68 1030 [Gflop/s] 159.84 515 Bandwidth [GB/s] 64 115 Flop-to-word ratio 20 36 Hyper-threading is not used. Performance for MAD operation. 32 GB/s per socket. ECC is enabled. Takahashi, Cecka, Darve FMM 18 / 25
Benchmark results: serial CPU breakdown Time [%] Time [%] CPU code (Single precision, n=4, d=0) 100 80 60 40 20 0 10 4 10 5 10 6 10 7 10 8 N CPU code (Double precision, n=4, d=0) 100 80 60 40 direct L2P L2L M2M P2M setup direct L2P L2L M2M P2M setup Time [%] Time [%] CPU code (Single precision, n=8, d=-1) 100 80 60 40 20 0 10 4 10 5 10 6 10 7 10 8 N CPU code (Double precision, n=8, d=-1) 100 80 60 40 direct L2P L2L M2M P2M setup direct L2P L2L M2M P2M setup P = particle M = multipole exp. L = local exp. Direct (P2P for leaf clusters) and occupy 60 98% of the total running time. 20 20 0 0 10 4 10 5 10 6 10 7 10 4 10 5 10 6 10 7 N N Takahashi, Cecka, Darve FMM 19 / 25
Benchmark results: CPU speedup using 12 cores (OpenMP) Single precision Speed-up 12 11 10 9 8 7 6 5 4 3 2 1 0 CPU code (Single precision, n=4, d=0) 10 4 10 5 10 6 10 7 10 8 N Total direct L2P L2L M2M P2M setup Speed-up 12 11 10 9 8 7 6 5 4 3 2 1 0 CPU code (Single precision, n=8, d=-1) 10 4 10 5 10 6 10 7 10 8 N Total direct L2P L2L M2M P2M setup Takahashi, Cecka, Darve FMM 20 / 25
Benchmark results: CPU speedup using 12 cores (OpenMP) Double precision Speed-up 12 11 10 9 8 7 6 5 4 3 2 1 0 CPU code (Double precision, n=4, d=0) 10 4 10 5 10 6 10 7 N Total direct L2P L2L M2M P2M setup Speed-up 12 11 10 9 8 7 6 5 4 3 2 1 0 CPU code (Double precision, n=8, d=-1) 10 4 10 5 10 6 10 7 N Total direct L2P L2L M2M P2M setup Takahashi, Cecka, Darve FMM 21 / 25
Speed-up of GPU vs CPU-12x Speed-up GPU vs CPU (Single precision, n=4, d=0) 3 2 1 Total direct Speed-up GPU vs CPU (Single precision, n=8, d=-1) 8 7 6 5 4 3 2 Total direct 1 0 10 4 10 5 10 6 10 7 10 8 N 0 10 4 10 5 10 6 10 7 10 8 N GPU vs CPU (Double precision, n=4, d=0) 4 3 Total direct GPU vs CPU (Double precision, n=8, d=-1) 7 6 5 Total direct Speed-up 2 Speed-up 4 3 1 2 1 0 10 4 10 5 10 6 10 7 N 0 10 4 10 5 10 6 10 7 N Takahashi, Cecka, Darve FMM 22 / 25
Gflops performance: computation P n d 10 4 10 5 10 6 10 7 10 8 CPU-12x s 4 0 95(3) 120(4) 123(5) 126(6) 128(7) GPU s 4 0 50(3) 173(4) 302(5) 359(6) 379(7) CPU-12x s 8 1 35(2) 39(3) 41(4) 43(5) 46(6) GPU s 8 1 63(2) 65(3) 192(4) 319(5) 375(6) CPU-12x d 4 0 58(3) 73(4) 75(5) 75(6) NA GPU d 4 0 26(3) 83(4) 140(5) 165(6) NA CPU-12x d 8 1 17(2) 18(3) 19(4) 20(5) NA GPU d 8 1 41(2) 30(3) 88(4) 146(5) NA Takahashi, Cecka, Darve FMM 23 / 25
Gflops performance: P2P P2P computation P n d 10 4 10 5 10 6 10 7 10 8 CPU-12x s 4 0 144(3) 169(4) 182(5) 194(6) 210(7) GPU s 4 0 192(3) 282(4) 376(5) 491(6) 629(7) CPU-12x s 8 1 221(2) 261(3) 275(4) 282(5) 286(6) GPU s 8 1 373(2) 705(3) 826(4) 852(5) 861(6) CPU-12x d 4 0 35(3) 38(4) 39(5) 40(6) NA GPU d 4 0 69(3) 96(4) 124(5) 157(6) NA CPU-12x d 8 1 33(2) 38(3) 40(4) 41(5) NA GPU d 8 1 98(2) 207(3) 243(4) 249(5) NA Takahashi, Cecka, Darve FMM 24 / 25
Conclusion Efficient and flexible fast multipole method: can be used for any analytic kernel, smooth or oscillatory. Good fraction of peak performance can be achieved. There is still room for improvement (e.g., instruction throughput optimization). We have not considered the case of an adaptive tree with a varying number of levels. Current work: use starpu to parallelize the method on multiple GPUs and CPUs. Overcome the tree-root bottleneck by dynamically scheduling the M2M,, L2L and P2P operators. Increase arithmetic intensity using symmetries. The code, bbfmm, used for this work is available from http://sourceforge.net/projects/bbfmmgpu Takahashi, Cecka, Darve FMM 25 / 25