Fast Multipole Method on the GPU with application to the Adaptive Vortex Method University of Bristol, Bristol, United Kingdom. 1
Introduction Particle methods Highly parallel Computational intensive Numerical Challenge: N-body problem Opportunity: Clever algorithms Massively parallel architectures (GPUs) Contribution: Mesh-less method. Accelerated using clever algorithms (FMM). Implementation for GPUs. 2
Overview of the presentation Adaptive Vortex Method (brief introduction) Algorithmic representation The Fast Multipole Method Introduction to the algorithm GPU implementation Lessons learned Final remark 3
Vortex Method for fluid simulation 4
Vortex Method for fluid simulation Incompresible Newtonian fluid (2D case) u t + u u = p ρ + ν 2 u Navier-Stokes equation on vorticity formulation ω ω = u t + u ω = ω u + ν 2 ω 5
Vortex Method for fluid simulation Discretize the vorticity field into particles ω σ (x, t) = N i=1 γ i ζ σ (x x i ) Each particle carries vorticity ω ζ σ (x) = 1 2πσ 2 exp ( x 2 2σ 2 ) Particles move with the fluid u dx i dt = u(x i,t) 6
Vortex Method for fluid simulation The velocity can be obtained from the vorticity field: ω = 2 ψ u(x) = 1 2π (x x ) ω(x )ê z x x 2 dx where ω is given by the discretized vorticity field, which results in an N-body problem: u σ (x, t) = N i=1 γ i K σ (x x i ) K σ = 1 ( )) 2π x 2 ( x 2,x 1 ) 1 exp ( x 2 2σ 2 7
Vortex Method Algorithm 8
Vortex Method algorithm 1.Discretization 2.Velocity evaluation 3.Convection 4.Diffusion 5.Spatial adaptation Start 2 1 5 4 3 N ω(x,t) ω σ (x,t)= i=1 Γ i (t)ζ σi (x x i (t)). End 9
Vortex Method algorithm 1.Discretization 2.Velocity evaluation 3.Convection 4.Diffusion 5.Spatial adaptation Start 2 1 5 4 3 N u σ (x,t)= j=1 Γ j K σ (x x j ) End 10
Vortex Method algorithm 1.Discretization 2.Velocity evaluation 3.Convection 4.Diffusion 5.Spatial adaptation Start 2 1 5 4 3 dx i dt = u(x i,t) End 11
Vortex Method algorithm 1.Discretization 2.Velocity evaluation 3.Convection 4.Diffusion 5.Spatial adaptation Start 2 1 5 4 3 dω dt = ν 2 ω End 12
Vortex Method algorithm 1.Discretization 2.Velocity evaluation 3.Convection 4.Diffusion 5.Spatial adaptation Start 2 1 5 4 3 N ω(x,t) ω σ (x,t)= i=1 Γ i (t)ζ σi (x x i (t)). End 13
VM advantages Low numerical diffusion. No mesh. It adapts to the fluid. VM challenges Efficient treatment of boundary conditions. Numerical: solution of an N-body problem. 14
Fast Multipole Method 15
Fast summation problem Accelerate the evaluation of problems of the form: f(y) = N c i K(y x i ) y [1...N] i=1 For N evaluations the total amount of work is proportional to N 2 We want to solve this kind of problems in less than O(N 2 ): We want a O(N) and highly accurate algorithm The FMM exchanges accuracy for speed and we control the accuracy. 16
! " # $ % & ' ( ) * +, -. / 0 1 2 3 4 5 6 7 8 9 : ; < = >? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { } ~ The Fast Multipole Method The FMM is based on ME to approximate the kernel function when evaluated far away from the origin. A ME is an infinite series truncated after p terms. This is how we control the accuracy of the approximation. K( y x c )= p a m (x c )f m (y) m=0 y y a m (x c ) : coefficient terms r r x c x i x 17
The Fast Multipole Method The basic idea is to use this ME to approximate a cluster of particles as a single pseudo particle. The bigger the distance to the cluster, the bigger the pseudo particles can be. Direct evaluation for all particles in the near-field. pseudo-particles particles Distance Evaluation point b r Domain decomposition 18
The Fast Multipole Method A Local Expansion (LE) is used to approximate the influence of a group of Multipole Expansions. An LE provides a local description of the influence of a particle that is located far away. Far field evaluation using a single Local Expansion. 19
The Fast Multipole Method A Local Expansion (LE) is used to approximate the influence of a group of Multipole Expansions. An LE provides a local description of the influence of a particle that is located far away. Far field evaluation using a single Local Expansion. 20
The Fast Multipole Method The computation related to the tree-structure, in the O(N) algorithm: Upward Sweep Downward Sweep Create Multipole Expansions. Evaluate Local Expansions. P2M M2M M2L L2L L2P 21
Fast Multipole Method on the GPU 22
Exposing task level parallelism Stages: 2 4 5 6 7 8 1 10 3 9 Setup Upward Sweep Downward Sweep Evaluation Directed Acyclic Graph of the FMM. Show tasks dependencies. Expose Task level parallelism. 1. Tree creation. 2. Particle clustering. 3. Listing of clusters interactions. 4. Particle to Multipole. 5. Multipole to Multipole. 6. Multipole to Local. 7. Local to Local. 8. Local to Particle. 9. Near field evaluation. 10. Adding near and far field contributions. 23
FMM: Computational time per stage Downward Sweep (M2L) and particle evaluation = over 99% of time. 10 3 10 2 ME Initialization Upward Sweep Downward Sweep Evaluation Total time Opportunities for these two stages, big gains. Particle evaluation easy to implement for the GPU. Time [sec] 10 1 10 0 10-1 Focus on Multipole-to-Local operations (M2L). 10-2 2 4 8 16 32 64 128 256 Number of processors Computational time Parallel FMM (PetFMM) 10 million particles FMM level 9 FMM terms 17 24
Accelerating the M2L M2L stage can over 99% of computation time. One LE is formed by several transformed MEs. In total, many LEs are produced but only one per cluster. (L=5 requires 27,648 M2L translations) The M2L transformations as a matrix vector operator. M2L implementation is: matrix free, and computationally intensive. ME (orange) used to produce a single LE (blue) M2L(t) ME LE M2L Transformation 25
Accelerating the M2L Work reorganization: From hierarchical structure to a Queue. Homogeneous units of work. Improved temporal locality. Upward Sweep Downward Sweep Create Multipole Expansions. Evaluate Local Expansions. P2M M2M M2L L2L L2P 26
Accelerating the M2L Work reorganization: From hierarchical structure to a Queue. Homogeneous units of work. Improved temporal locality. Upward Sweep Downward Sweep Reorganized Task Queue M2L(A, c 1 ) M2L(A, c 2 ) Reorganize computations M2L(A, c 3 ) M2L(B, c 1 ) Create Multipole Expansions. Evaluate Local Expansions. P2M M2M M2L L2L L2P M2L(B, c 2 ) M2L(B, c 3 ) 27
GPU kernel version 1 Each thread transforms one ME. Matrix free multiplication. Efficient matrix creation and multiplication. No thread synchronization is required. Resource intensive thread. ME LE Non-coalesced memory transactions. Single thread computation pattern Result: 20 Giga-operations. (1 C1060 card) 20x speedup. 28
GPU kernel version 2 Many threads transform one ME. One thread computes only one term. Less float-operation efficient. More parallelism. Coalesced memory transactions. ME LE Less resources per thread. Other memory tricks. Multiple threads computation pattern Result: 482 Giga-operations. (1 C1060 card) 100x speedup. 29
Lessons Learned 30
Paradigm shift Start by exposing parallelism: Think about homogeneous units of work. Think about thousands of parallel operations. Think about smart usage of resources. Trade operation efficiency for more parallel and resource efficient kernels. Think about heterogeneous computing. GPUs are not a silver bullet. Use CPU to reorganize work. 31
Conclusions Heterogeneous Computing: use all available hardware! Current FMM peak: 480 giga-ops. Methodology: Identify and expose parallelism Distribute work between CPU and GPU Use the best for each job! Current Work: Parallel FMM library (many applications) Multi-GPU implementation of the FMM. 32
Ongoing work Particle methods maps well to new architectures. However, particle methods has the disadvantage of not being as mature as mesh-based methods. Much more research has been done for conventional mesh methods. On going work: A compromise between method, hybrid particle-mesh methods on new architectures. 33
Final remark Novel Architectures Current Applications How to cross the bridge between new technologies to current applications? Re-develop algorithms can give large speedups but is far from trivial. Port algorithms can give small speedups with less effort. Cost effective solution: Research / development of heterogeneity aware libraries. 34
Thanks for listening 35
Velocity calculation: Gaussian particles N-body problem ζ σ (x) = 1 2πσ 2 exp ( x 2 2σ 2 ) vorticity ω σ (x, t) = N γ i ζ σ (x x i ) i=1 velocity u σ (x, t) = N γ i K σ (x x i ) i=1 with K σ = 1 2π x 2 ( x 2,x 1 ) ( )) 1 exp ( x 2 2σ 2 36
Vortex sheet Discontinuity in the velocity field. Represented by vortex elements. γ(s) 1 π [ n [log x(s) x(s ) ] ρ 1(s) L ]γ(s )ds = 2 u slip ŝ ω t ν ω =0, ω(t δt) =0, ν ω n = γ(s) δt 37
Vortex Method algorithm 1.Discretization 2.Velocity evaluation 3.Convection 4.Diffusion 5.Spatial adaptation Start 2 1 5 4 3 End
Vortex method algorithm with panel-free boundary conditions Start 5 B 1 A 2 4 A.Vortex sheet calculation B.Vortex sheet diffusion 3 End
Vortex method algorithm with panel-free boundary conditions Start 5 B 1 A 2 4 A.Vortex sheet calculation B.Vortex sheet diffusion 3 End
Vortex method algorithm with panel-free boundary conditions Start 5 B 1 A 2 4 A.Vortex sheet calculation B.Vortex sheet diffusion 3 End
Panel-free method Discretize into points. Particle discretization Points are the control points. B.C. are enforced at the control points. RBF solution. 42
Panel-free method Discretize into points. Particle discretization Points are the control points. B.C. are enforced at the control points. RBF solution. γ(x) N φ( x c i )α i i=1
{ Accelerating the M2L M2L: Two stage computation ME ME ME ME ME ME Stage 1: Transformation of ME. Stage 2: Reduction of LE. LE 44
PetFMM Parallel extensible toolkit for the FMM M2M and L2L translations M2L transformation Local domain Root tree Level k Sub-tree 1 Sub-tree 2 Sub-tree 3 Sub-tree 4 Sub-tree 5 Sub-tree 6 Sub-tree 7 Sub-tree 8 Parallelization strategy 45
PetFMM Parallel extensible toolkit for the FMM w i c ij w j Parallel work distribution 46
PetFMM Parallel extensible toolkit for the FMM 256 128 64 32 Speedup 16 8 4 2 1 uniform 4ML8R5 uniform 10ML9R5 spiral 1ML8R5 spiral w/ space-filling 1ML8R5 Perfect Speedup 2 4 8 16 32 64 128 256 Number of processors Speedup of PetFMM for different test cases 47