Center for Computational Science

Similar documents
Fast Multipole Method on the GPU

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Stokes Preconditioning on a GPU

Two-Phase flows on massively parallel multi-gpu clusters

ExaFMM. Fast multipole method software aiming for exascale systems. User's Manual. Rio Yokota, L. A. Barba. November Revision 1

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

FINITE POINTSET METHOD FOR 2D DAM-BREAK PROBLEM WITH GPU-ACCELERATION. M. Panchatcharam 1, S. Sundar 2

Software and Performance Engineering for numerical codes on GPU clusters

CUDA Experiences: Over-Optimization and Future HPC

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

A Kernel-independent Adaptive Fast Multipole Method

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

Fast Multipole and Related Algorithms

Tesla Architecture, CUDA and Optimization Strategies

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors

Using GPUs to compute the multilevel summation of electrostatic forces

cuibm A GPU Accelerated Immersed Boundary Method

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Dense Linear Algebra. HPC - Algorithms and Applications

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Tesla GPU Computing A Revolution in High Performance Computing

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Numerical Algorithms on Multi-GPU Architectures

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

NVIDIA Fermi Architecture

CSE 160 Lecture 24. Graphical Processing Units

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Fast-multipole algorithms moving to Exascale

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Dense matching GPU implementation

OpenACC programming for GPGPUs: Rotor wake simulation

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging

Double-Precision Matrix Multiply on CUDA

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

Automated Finite Element Computations in the FEniCS Framework using GPUs

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Tesla GPU Computing A Revolution in High Performance Computing

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Introduction to GPU computing

Gradient Free Design of Microfluidic Structures on a GPU Cluster

Accelerating Molecular Modeling Applications with Graphics Processors

2009: The GPU Computing Tipping Point. Jen-Hsun Huang, CEO

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster

High Performance Computing on GPUs using NVIDIA CUDA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Parallel 3D Sweep Kernel with PaRSEC

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

General Purpose GPU Computing in Partial Wave Analysis

Fast Multipole Methods on a Cluster of GPUs for the Meshless Simulation of Turbulence

Tree-based methods on GPUs

ME964 High Performance Computing for Engineering Applications

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Fast Methods with Sieve

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

21. Efficient and fast numerical methods to compute fluid flows in the geophysical β plane

Introduction to GPU hardware and to CUDA

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Introduction to CELL B.E. and GPU Programming. Agenda

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Applications of Berkeley s Dwarfs on Nvidia GPUs

GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging

GPU Fundamentals Jeff Larkin November 14, 2016

S4289: Efficient solution of multiple scalar and block-tridiagonal equations

Advanced CUDA Optimizing to Get 20x Performance

Fast Multipole Method on GPU: Tackling 3-D Capacitance Extraction on Massively Parallel SIMD Platforms

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Multi-Processors and GPU

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

Portland State University ECE 588/688. Graphics Processors

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Performance of serial C programs. Performance of serial C programs p. 1

LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

B. Tech. Project Second Stage Report on

Computer Performance Evaluation: Cycles Per Instruction (CPI)

Transcription:

Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz, PhD student Simon Layton, PhD student Rio Yokota, postdoc

Topics Meshfree method for fluids simulation vortex method Computational challenge N body summation involved Innovation using new hardware scientific computing on graphics cards (GPU) 2

Meshfree method for fluid simulation

Vortex method for fluid simulation u t + u u = p ρ + ν 2 u 4

Vortex method for fluid simulation particle method for incompressible, Newtonian fluid u t + u u = p ρ + ν 2 u 4

Vortex method for fluid simulation particle method for incompressible, Newtonian fluid u t + u u = p ρ + ν 2 u based on vorticity, ω = u 4

Vortex method for fluid simulation particle method for incompressible, Newtonian fluid u t + u u = p ρ + ν 2 u based on vorticity, ω = u ω t + u ω = ω u + ν 2 ω 4

Vorticity transport equation ω t + u ω = ω u + ν 2 ω 2D ideal case Dω Dt =0 if velocity is known for a fluid element at xi vorticity transport automatically satisfied by dx i dt = u(x i,t) 5

Vortex method discretization express vorticity as interpreted as particles dx i dt = u(x i,t) Gaussian distribution ω ω σ (x, t) = ζ σ (x) = 1 2πσ 2 exp N i=1 γ i ζ σ (x x i ) u ( x 2 2σ 2 ) Weights = Circulation strength 6

Find velocity from vorticity: invert ω = 2 ψ in 2D u(x) = 1 2π (x x ) ω(x )ê z x x 2 dx get: u σ (x, t) = N γ i K σ (x x i ) ω σ (x, t) = N i=1 γ i ζ σ (x x i ) i=1 with K σ = 1 2π x 2 ( x 2,x 1 ) ( )) 1 exp ( x 2 2σ 2 7

Find velocity from vorticity: invert ω = 2 ψ in 2D u(x) = 1 2π (x x ) ω(x )ê z x x 2 dx get: with u σ (x, t) = N i=1 K σ = 1 2π x 2 ( x 2,x 1 ) γ i K σ (x x i ) ( )) 1 exp ( x 2 2σ 2 ω σ (x, t) = N γ i ζ σ (x x i ) i=1 Challenge: Calculating the velocity N body problem 7

Advantages No mesh! Low numerical diffusion traditional CFD methods ω t + u ω = ω u + ν 2 ω Consider: Helicopter rotor-tip vortices Source U.S. Navy's Digital Image site 8

Fast solution of N-body problem

Fast multipole method Solves N-body problems e.g. astrophysical gravity interactions reduces operation count from O(N 2 ) to O(N) f(y) = N c i K(y x i ) y [1...N] i=1 10

Fast multipole method Solves N-body problems e.g. astrophysical gravity interactions reduces operation count from O(N 2 ) to O(N) f(y) = N c i K(y x i ) y [1...N] i=1 10

Fast multipole method Solves N-body problems e.g. astrophysical gravity interactions reduces operation count from O(N 2 ) to O(N) f(y) = N c i K(y x i ) y [1...N] i=1 10

space subdivision tree structure to find near and far bodies 11

The whole algorithm in a sketch Upward Sweep Downward Sweep Create Multipole Expansions. Evaluate Local Expansions. P2M M2M M2L L2L L2P 12

Open-source library: PetFMM Code http://barbagroup.bu.edu/barba_group/petfmm.html 13

Innovation using new hardware

Unwelcome advice A. Gholum, Intel (June 2008) Incremental path: scaling to tap dual- and quad-core performance a flat route (EASY) but leading to nowhere only option for legacy code Going back to the algorithmic drawing board HARD rethinking core methods 10s, 100s, 1000s of cores basic logic of the application is influenced design for parallelism Intelligent Software workshop, 20/10/09 15

Key features of the GPU architecture Nvidia Tesla C1060 (GT200 chip) 30 multi-processors (MP) with 8 cores each = 240 processor cores cores clocked at 1.296 GHz each MP has shared memory of 16 kb device has 4 GB of global memory Intelligent Software workshop, 20/10/09 16

Key features of the GPU architecture Nvidia GT200 chip 1.296 GHz x 10 TCP x 3 SM x 8 SP x 2 flop/cycle = 622 Gflop/s!"#$%&'( 1.296 GHz x 10 TCP x 3 SM x 2 SFU x 4 FPU x 1 flop/cycle = 311 Gflop/s Total = 933 Gflop/s )*$ +"#,*-) *';47(' 2(#/'99#( /"794'(!"#$%" &'&#() *'; 3: 3: 3:!+-./012 *+, *+, *+, *+, *+, *+, *+, *+, *+, *+, 34('%&156 &7"4182(#/'99#( 30%('= &'&#() 3+ 3+ 3+ 3+ 3<- 3<- 3+ 3+ 3+ 3+ Intelligent Software workshop, 20/10/09 17

Key features of the CUDA model Released Feb. 2007 Intelligent Software workshop, 20/10/09 18

Key features of the CUDA model Released Feb. 2007 kernels work that will be performed by each parallel thread Intelligent Software workshop, 20/10/09 18

Key features of the CUDA model Released Feb. 2007 kernels work that will be performed by each parallel thread hierarchy of thread blocks and a grid of thread blocks Intelligent Software workshop, 20/10/09 18

Key features of the CUDA model Released Feb. 2007 kernels work that will be performed by each parallel thread hierarchy of thread blocks and a grid of thread blocks Intelligent Software workshop, 20/10/09 18

Key features of the CUDA model Released Feb. 2007 kernels work that will be performed by each parallel thread hierarchy of thread blocks and a grid of thread blocks shared memory (16 kb) threads in a thread block reside in the same MP and cooperate via their shared memory and synchronization points Intelligent Software workshop, 20/10/09 18

Formulating mathematical algorithms for the new architecture

Challenges of the paradigm shift 20

Challenges of the paradigm shift Limiting factor for performance ability of programmers to produce code that scales 20

Challenges of the paradigm shift Limiting factor for performance ability of programmers to produce code that scales Rethink algorithms massively-parallel architecture (10k threads) computationally-intensive will be the winner 20

Challenges of the paradigm shift Limiting factor for performance ability of programmers to produce code that scales Rethink algorithms massively-parallel architecture (10k threads) computationally-intensive will be the winner Implementation difficulties steep learning curve of getting down to the metal memory resources handled by hand data access pattern now crucial 20

Resource-conscious algorithms Data-parallel programming Number of threads thousands! affects the entire algorithm Memory shared memory only 16 kb! global memory high cost of access: 400-600 cycles Branching MPs manage threads in groups of 32, called warps In a warp, threads start and end together In branching warp executes all paths serially! 21

Formulation of the FMM for the GPU Recent work: 22

going back to the algorithmic drawing board Felipe Cruz, PhD student M2L = matrix-vector multiplication dense matrix of size equal to 2p 2 large number of mat-vecs 27 x 4 L (in 2D) e.g. L=5 27,648 mat-vecs Version 0 each mat-vec in a separate thread e.g. p=12, matrix size 2p 2 =288 2304 bytes max. of 6 fit in shared memory too few threads! 23

going back to the algorithmic drawing board Felipe Cruz, PhD student M2L = matrix-vector multiplication dense matrix of size equal to 2p 2 large number of mat-vecs 27 x 4 L (in 2D) e.g. L=5 27,648 mat-vecs Version 0 each mat-vec in a separate thread e.g. p=12, matrix size 2p 2 =288 2304 bytes compute on-the-fly max. of 6 fit in shared memory too few threads! 23

going back to the algorithmic drawing board structure of matrix optimization: matrix-free mat-vec traverse matrix by diagonals reuse terms ME LE 24

going back to the algorithmic drawing board Version 1 : each thread block transforms one ME for all the interaction list. max 8 concurrent thread blocks per MP 27x8= 216 concurrent threads current maximum of threads is 512 still under-utilized Result: 20 Gflops peak 2.5x10 6 t/s We are not reporting speed-up anymore, because it does not mean very much. A CPU run produced 1.42x10 5 t/s 25

going back to the algorithmic drawing board we re still not happy want more parallelism too many memory transactions result moved to global memory one LE, size 2p 27 results 27 x 2p = 648 floats for p=12 only 32 workers to move results 20 memory transactions 26

going back to the algorithmic drawing board Version 2 : one thread per element of the LE each thread does 1 row-vec multiply no thread synchronization required BUT cannot use the efficient diagonal traversal (about 25% more work) ME LE 27

going back to the algorithmic drawing board Version 2 : avoiding branching each thread pre-computes and additional t factors added as needed t n+1 loop over n in each thread: naturally would stop at different value instead (counter-intuitive!), all threads loop until n = p +1 but store only their relevant power no optimizing/tuning yet... Result: 94 Gflops peak 6x10 6 t/s 28

key features of new algorithm 1. increased number of threads per block (we can have any number) 2. all threads always perform the same computations (on different data) 3. loop-unrolling (if done manually, performance gain even greater!) 4. reduced accesses to global memory 5. when accessing memory, ensure it is coalesced Final step: overlap memory accesses Result: 480 Gflops peak 19x10 6 t/s 29

Context of this project Fast summation algorithms we have a distributed parallel library of the fast multipole method, PetFMM (*) developing GPU implementations currently running at about 500 gigaflops on one card speedups w.r.t. fastest CPU available. Applications: meshfree fluid simulation, using N-body solvers (*) In collaboration with Matthew Knepley (UChicago) 30

Frontiers of CFD Need to straddle many scales Algorithms to detect and adapt to solution meshfree methods naturally can adapt Hardware-aware software meshfree methods well-suited for GPU! Tackle problems with complex/moving geometry Algorithm/software that allows real-time simulation 31