Fast Multipole Method on the GPU

Similar documents
Center for Computational Science

Fast Multipole and Related Algorithms

ExaFMM. Fast multipole method software aiming for exascale systems. User's Manual. Rio Yokota, L. A. Barba. November Revision 1

A Kernel-independent Adaptive Fast Multipole Method

Stokes Preconditioning on a GPU

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors

Tree-based methods on GPUs

21. Efficient and fast numerical methods to compute fluid flows in the geophysical β plane

Software and Performance Engineering for numerical codes on GPU clusters

cuibm A GPU Accelerated Immersed Boundary Method

Fast Multipole Methods on a Cluster of GPUs for the Meshless Simulation of Turbulence

GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging

Slat noise prediction with Fast Multipole BEM based on anisotropic synthetic turbulence sources

Accelerated flow acoustic boundary element solver and the noise generation of fish

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Two-Phase flows on massively parallel multi-gpu clusters

Virtual EM Inc. Ann Arbor, Michigan, USA

GPU-based Distributed Behavior Models with CUDA

PHYSICALLY BASED ANIMATION

Numerical Algorithms on Multi-GPU Architectures

Fast Methods with Sieve

Using GPUs to compute the multilevel summation of electrostatic forces

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

Panel methods are currently capable of rapidly solving the potential flow equation on rather complex

Collocation and optimization initialization

Stream Function-Vorticity CFD Solver MAE 6263

Parallelized Coupled Solver (PCS) Model Refinements & Extensions

Efficient tools for the simulation of flapping wing flows

Investigation of cross flow over a circular cylinder at low Re using the Immersed Boundary Method (IBM)

FINITE POINTSET METHOD FOR 2D DAM-BREAK PROBLEM WITH GPU-ACCELERATION. M. Panchatcharam 1, S. Sundar 2

NIA CFD Futures Conference Hampton, VA; August 2012

A brief description of the particle finite element method (PFEM2). Extensions to free surface

arxiv: v4 [cs.na] 20 Aug 2012

Parallel and Distributed Systems Lab.

Lecture 1.1 Introduction to Fluid Dynamics

Integral Equation Methods for Vortex Dominated Flows, a High-order Conservative Eulerian Approach

Possibility of Implicit LES for Two-Dimensional Incompressible Lid-Driven Cavity Flow Based on COMSOL Multiphysics

Interdisciplinary practical course on parallel finite element method using HiFlow 3

Topology optimization of heat conduction problems

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

Driven Cavity Example

A Deterministic Viscous Vortex Method for Grid-free CFD with Moving Boundary Conditions

Computing Nearly Singular Solutions Using Pseudo-Spectral Methods

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

(LSS Erlangen, Simon Bogner, Ulrich Rüde, Thomas Pohl, Nils Thürey in collaboration with many more

The Fast Multipole Method and the Radiosity Kernel

A higher-order finite volume method with collocated grid arrangement for incompressible flows

Vortex Method Applications. Peter S. Bernard University of Maryland

Numerical Simulation of Coupled Fluid-Solid Systems by Fictitious Boundary and Grid Deformation Methods

ALE Seamless Immersed Boundary Method with Overset Grid System for Multiple Moving Objects

CUDA Experiences: Over-Optimization and Future HPC

Geodesics in heat: A new approach to computing distance

Scalable, Hybrid-Parallel Multiscale Methods using DUNE

Kernel Independent FMM

ME964 High Performance Computing for Engineering Applications

Mass-Spring Systems. Last Time?

Particle-Based Fluid Simulation. CSE169: Computer Animation Steve Rotenberg UCSD, Spring 2016

Towards a Parallel, 3D Simulation of Platelet Aggregation and Blood Coagulation

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Parallel 3D Sweep Kernel with PaRSEC

An Embedded Boundary Method with Adaptive Mesh Refinements

Massively Parallel Phase Field Simulations using HPC Framework walberla

Exploring the features of OpenCL 2.0

Solving a Two Dimensional Unsteady-State. Flow Problem by Meshless Method

Quasi-3D Computation of the Taylor-Green Vortex Flow

Realistic Animation of Fluids

Intermediate Parallel Programming & Cluster Computing

Gradient Free Design of Microfluidic Structures on a GPU Cluster

COMPUTATIONAL METHODS FOR ENVIRONMENTAL FLUID MECHANICS

Available online at ScienceDirect. Parallel Computational Fluid Dynamics Conference (ParCFD2013)

Lecture 7: Introduction to HFSS-IE

Superdiffusion and Lévy Flights. A Particle Transport Monte Carlo Simulation Code

Accepted Manuscript. A resilient and efficient CFD framework: Statistical learning tools for multi-fidelity and heterogeneous information fusion

Coping with the Ice Accumulation Problems on Power Transmission Lines

Lattice Boltzmann with CUDA

Computational Fluid Dynamics using OpenCL a Practical Introduction

Inviscid Flows. Introduction. T. J. Craft George Begg Building, C41. The Euler Equations. 3rd Year Fluid Mechanics

MESHLESS SOLUTION OF INCOMPRESSIBLE FLOW OVER BACKWARD-FACING STEP

Numerical Analysis of Shock Tube Problem by using TVD and ACM Schemes

Application of STAR-CCM+ to Helicopter Rotors in Hover

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

Reproducibility of Complex Turbulent Flow Using Commercially-Available CFD Software

SPH: Why and what for?

Transition modeling using data driven approaches

Network traffic: Scaling

1 Past Research and Achievements

Shallow Water Simulations on Graphics Hardware

Overview of research activities Toward portability of performance

Parallel FFT Program Optimizations on Heterogeneous Computers

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

The Immersed Interface Method

Finite Volume Discretization on Irregular Voronoi Grids

A Novel Approach to High Speed Collision

smooth coefficients H. Köstler, U. Rüde

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Technical Report TR

Reproducibility of Complex Turbulent Flow Using Commercially-Available CFD Software

LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS

FOURTH ORDER COMPACT FORMULATION OF STEADY NAVIER-STOKES EQUATIONS ON NON-UNIFORM GRIDS

Transcription:

Fast Multipole Method on the GPU with application to the Adaptive Vortex Method University of Bristol, Bristol, United Kingdom. 1

Introduction Particle methods Highly parallel Computational intensive Numerical Challenge: N-body problem Opportunity: Clever algorithms Massively parallel architectures (GPUs) Contribution: Mesh-less method. Accelerated using clever algorithms (FMM). Implementation for GPUs. 2

Overview of the presentation Adaptive Vortex Method (brief introduction) Algorithmic representation The Fast Multipole Method Introduction to the algorithm GPU implementation Lessons learned Final remark 3

Vortex Method for fluid simulation 4

Vortex Method for fluid simulation Incompresible Newtonian fluid (2D case) u t + u u = p ρ + ν 2 u Navier-Stokes equation on vorticity formulation ω ω = u t + u ω = ω u + ν 2 ω 5

Vortex Method for fluid simulation Discretize the vorticity field into particles ω σ (x, t) = N i=1 γ i ζ σ (x x i ) Each particle carries vorticity ω ζ σ (x) = 1 2πσ 2 exp ( x 2 2σ 2 ) Particles move with the fluid u dx i dt = u(x i,t) 6

Vortex Method for fluid simulation The velocity can be obtained from the vorticity field: ω = 2 ψ u(x) = 1 2π (x x ) ω(x )ê z x x 2 dx where ω is given by the discretized vorticity field, which results in an N-body problem: u σ (x, t) = N i=1 γ i K σ (x x i ) K σ = 1 ( )) 2π x 2 ( x 2,x 1 ) 1 exp ( x 2 2σ 2 7

Vortex Method Algorithm 8

Vortex Method algorithm 1.Discretization 2.Velocity evaluation 3.Convection 4.Diffusion 5.Spatial adaptation Start 2 1 5 4 3 N ω(x,t) ω σ (x,t)= i=1 Γ i (t)ζ σi (x x i (t)). End 9

Vortex Method algorithm 1.Discretization 2.Velocity evaluation 3.Convection 4.Diffusion 5.Spatial adaptation Start 2 1 5 4 3 N u σ (x,t)= j=1 Γ j K σ (x x j ) End 10

Vortex Method algorithm 1.Discretization 2.Velocity evaluation 3.Convection 4.Diffusion 5.Spatial adaptation Start 2 1 5 4 3 dx i dt = u(x i,t) End 11

Vortex Method algorithm 1.Discretization 2.Velocity evaluation 3.Convection 4.Diffusion 5.Spatial adaptation Start 2 1 5 4 3 dω dt = ν 2 ω End 12

Vortex Method algorithm 1.Discretization 2.Velocity evaluation 3.Convection 4.Diffusion 5.Spatial adaptation Start 2 1 5 4 3 N ω(x,t) ω σ (x,t)= i=1 Γ i (t)ζ σi (x x i (t)). End 13

VM advantages Low numerical diffusion. No mesh. It adapts to the fluid. VM challenges Efficient treatment of boundary conditions. Numerical: solution of an N-body problem. 14

Fast Multipole Method 15

Fast summation problem Accelerate the evaluation of problems of the form: f(y) = N c i K(y x i ) y [1...N] i=1 For N evaluations the total amount of work is proportional to N 2 We want to solve this kind of problems in less than O(N 2 ): We want a O(N) and highly accurate algorithm The FMM exchanges accuracy for speed and we control the accuracy. 16

! " # $ % & ' ( ) * +, -. / 0 1 2 3 4 5 6 7 8 9 : ; < = >? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { } ~ The Fast Multipole Method The FMM is based on ME to approximate the kernel function when evaluated far away from the origin. A ME is an infinite series truncated after p terms. This is how we control the accuracy of the approximation. K( y x c )= p a m (x c )f m (y) m=0 y y a m (x c ) : coefficient terms r r x c x i x 17

The Fast Multipole Method The basic idea is to use this ME to approximate a cluster of particles as a single pseudo particle. The bigger the distance to the cluster, the bigger the pseudo particles can be. Direct evaluation for all particles in the near-field. pseudo-particles particles Distance Evaluation point b r Domain decomposition 18

The Fast Multipole Method A Local Expansion (LE) is used to approximate the influence of a group of Multipole Expansions. An LE provides a local description of the influence of a particle that is located far away. Far field evaluation using a single Local Expansion. 19

The Fast Multipole Method A Local Expansion (LE) is used to approximate the influence of a group of Multipole Expansions. An LE provides a local description of the influence of a particle that is located far away. Far field evaluation using a single Local Expansion. 20

The Fast Multipole Method The computation related to the tree-structure, in the O(N) algorithm: Upward Sweep Downward Sweep Create Multipole Expansions. Evaluate Local Expansions. P2M M2M M2L L2L L2P 21

Fast Multipole Method on the GPU 22

Exposing task level parallelism Stages: 2 4 5 6 7 8 1 10 3 9 Setup Upward Sweep Downward Sweep Evaluation Directed Acyclic Graph of the FMM. Show tasks dependencies. Expose Task level parallelism. 1. Tree creation. 2. Particle clustering. 3. Listing of clusters interactions. 4. Particle to Multipole. 5. Multipole to Multipole. 6. Multipole to Local. 7. Local to Local. 8. Local to Particle. 9. Near field evaluation. 10. Adding near and far field contributions. 23

FMM: Computational time per stage Downward Sweep (M2L) and particle evaluation = over 99% of time. 10 3 10 2 ME Initialization Upward Sweep Downward Sweep Evaluation Total time Opportunities for these two stages, big gains. Particle evaluation easy to implement for the GPU. Time [sec] 10 1 10 0 10-1 Focus on Multipole-to-Local operations (M2L). 10-2 2 4 8 16 32 64 128 256 Number of processors Computational time Parallel FMM (PetFMM) 10 million particles FMM level 9 FMM terms 17 24

Accelerating the M2L M2L stage can over 99% of computation time. One LE is formed by several transformed MEs. In total, many LEs are produced but only one per cluster. (L=5 requires 27,648 M2L translations) The M2L transformations as a matrix vector operator. M2L implementation is: matrix free, and computationally intensive. ME (orange) used to produce a single LE (blue) M2L(t) ME LE M2L Transformation 25

Accelerating the M2L Work reorganization: From hierarchical structure to a Queue. Homogeneous units of work. Improved temporal locality. Upward Sweep Downward Sweep Create Multipole Expansions. Evaluate Local Expansions. P2M M2M M2L L2L L2P 26

Accelerating the M2L Work reorganization: From hierarchical structure to a Queue. Homogeneous units of work. Improved temporal locality. Upward Sweep Downward Sweep Reorganized Task Queue M2L(A, c 1 ) M2L(A, c 2 ) Reorganize computations M2L(A, c 3 ) M2L(B, c 1 ) Create Multipole Expansions. Evaluate Local Expansions. P2M M2M M2L L2L L2P M2L(B, c 2 ) M2L(B, c 3 ) 27

GPU kernel version 1 Each thread transforms one ME. Matrix free multiplication. Efficient matrix creation and multiplication. No thread synchronization is required. Resource intensive thread. ME LE Non-coalesced memory transactions. Single thread computation pattern Result: 20 Giga-operations. (1 C1060 card) 20x speedup. 28

GPU kernel version 2 Many threads transform one ME. One thread computes only one term. Less float-operation efficient. More parallelism. Coalesced memory transactions. ME LE Less resources per thread. Other memory tricks. Multiple threads computation pattern Result: 482 Giga-operations. (1 C1060 card) 100x speedup. 29

Lessons Learned 30

Paradigm shift Start by exposing parallelism: Think about homogeneous units of work. Think about thousands of parallel operations. Think about smart usage of resources. Trade operation efficiency for more parallel and resource efficient kernels. Think about heterogeneous computing. GPUs are not a silver bullet. Use CPU to reorganize work. 31

Conclusions Heterogeneous Computing: use all available hardware! Current FMM peak: 480 giga-ops. Methodology: Identify and expose parallelism Distribute work between CPU and GPU Use the best for each job! Current Work: Parallel FMM library (many applications) Multi-GPU implementation of the FMM. 32

Ongoing work Particle methods maps well to new architectures. However, particle methods has the disadvantage of not being as mature as mesh-based methods. Much more research has been done for conventional mesh methods. On going work: A compromise between method, hybrid particle-mesh methods on new architectures. 33

Final remark Novel Architectures Current Applications How to cross the bridge between new technologies to current applications? Re-develop algorithms can give large speedups but is far from trivial. Port algorithms can give small speedups with less effort. Cost effective solution: Research / development of heterogeneity aware libraries. 34

Thanks for listening 35

Velocity calculation: Gaussian particles N-body problem ζ σ (x) = 1 2πσ 2 exp ( x 2 2σ 2 ) vorticity ω σ (x, t) = N γ i ζ σ (x x i ) i=1 velocity u σ (x, t) = N γ i K σ (x x i ) i=1 with K σ = 1 2π x 2 ( x 2,x 1 ) ( )) 1 exp ( x 2 2σ 2 36

Vortex sheet Discontinuity in the velocity field. Represented by vortex elements. γ(s) 1 π [ n [log x(s) x(s ) ] ρ 1(s) L ]γ(s )ds = 2 u slip ŝ ω t ν ω =0, ω(t δt) =0, ν ω n = γ(s) δt 37

Vortex Method algorithm 1.Discretization 2.Velocity evaluation 3.Convection 4.Diffusion 5.Spatial adaptation Start 2 1 5 4 3 End

Vortex method algorithm with panel-free boundary conditions Start 5 B 1 A 2 4 A.Vortex sheet calculation B.Vortex sheet diffusion 3 End

Vortex method algorithm with panel-free boundary conditions Start 5 B 1 A 2 4 A.Vortex sheet calculation B.Vortex sheet diffusion 3 End

Vortex method algorithm with panel-free boundary conditions Start 5 B 1 A 2 4 A.Vortex sheet calculation B.Vortex sheet diffusion 3 End

Panel-free method Discretize into points. Particle discretization Points are the control points. B.C. are enforced at the control points. RBF solution. 42

Panel-free method Discretize into points. Particle discretization Points are the control points. B.C. are enforced at the control points. RBF solution. γ(x) N φ( x c i )α i i=1

{ Accelerating the M2L M2L: Two stage computation ME ME ME ME ME ME Stage 1: Transformation of ME. Stage 2: Reduction of LE. LE 44

PetFMM Parallel extensible toolkit for the FMM M2M and L2L translations M2L transformation Local domain Root tree Level k Sub-tree 1 Sub-tree 2 Sub-tree 3 Sub-tree 4 Sub-tree 5 Sub-tree 6 Sub-tree 7 Sub-tree 8 Parallelization strategy 45

PetFMM Parallel extensible toolkit for the FMM w i c ij w j Parallel work distribution 46

PetFMM Parallel extensible toolkit for the FMM 256 128 64 32 Speedup 16 8 4 2 1 uniform 4ML8R5 uniform 10ML9R5 spiral 1ML8R5 spiral w/ space-filling 1ML8R5 Perfect Speedup 2 4 8 16 32 64 128 256 Number of processors Speedup of PetFMM for different test cases 47