S7260: Microswimmers on Speed: Simulating Spheroidal Squirmers on GPUs

Similar documents
Computational Fluid Dynamics (CFD) using Graphics Processing Units

CUDA Particles. Simon Green

S Voting And Shuffling For Fewer Atomic Operations

From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133)

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

Accelerating CFD with Graphics Hardware

Profiling & Tuning Applications. CUDA Course István Reguly

About Phoenix FD PLUGIN FOR 3DS MAX AND MAYA. SIMULATING AND RENDERING BOTH LIQUIDS AND FIRE/SMOKE. USED IN MOVIES, GAMES AND COMMERCIALS.

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

1/25/12. Administrative

GPU-based Distributed Behavior Models with CUDA

Numerical Algorithms on Multi-GPU Architectures

Shape of Things to Come: Next-Gen Physics Deep Dive

CUDA/OpenGL Fluid Simulation. Nolan Goodnight

GPU programming basics. Prof. Marco Bertini

Software and Performance Engineering for numerical codes on GPU clusters

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Accelerating SRD Simulation on GPU

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Modeling Evaporating Liquid Spray

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

RAMSES on the GPU: An OpenACC-Based Approach

NVIDIA. Interacting with Particle Simulation in Maya using CUDA & Maximus. Wil Braithwaite NVIDIA Applied Engineering Digital Film

What is a Rigid Body?

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

OpenACC programming for GPGPUs: Rotor wake simulation

LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS

Optimisation Myths and Facts as Seen in Statistical Physics

Navier-Stokes & Flow Simulation

CUDA Particles. Simon Green

Real-time Graphics 9. GPGPU

Abstract. Introduction. Kevin Todisco

Designing a Domain-specific Language to Simulate Particles. dan bailey

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Particle Systems. Sample Particle System. What is a particle system? Types of Particle Systems. Stateless Particle System

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

Tesla Architecture, CUDA and Optimization Strategies

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

6. Parallel Volume Rendering Algorithms

CGT 581 G Fluids. Overview. Some terms. Some terms

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

Modeling Evaporating Liquid Spray

Vector Addition on the Device: main()

Image convolution with CUDA

Device Memories and Matrix Multiplication

Anatomy of a Physics Engine. Erwin Coumans

ECE 574 Cluster Computing Lecture 15

Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method

Accelerating the acceleration search a case study. By Chris Laidler

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

GAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES TAKAHIRO HARADA, AMD DESTRUCTION FOR GAMES ERWIN COUMANS, AMD

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

High Performance Computing and GPU Programming

GPGPUGPGPU: Multi-GPU Programming

UberFlow: A GPU-Based Particle Engine

Real-time Graphics 9. GPGPU

Optimizing CUDA for GPU Architecture. CSInParallel Project

Using GPUs to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

Dense Linear Algebra. HPC - Algorithms and Applications

Matrix Multiplication in CUDA. A case study

GPU Tutorial 2: Integrating GPU Computing into a Project

Module 3: CUDA Execution Model -I. Objective

Lecture 2: CUDA Programming

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Navier-Stokes & Flow Simulation

Outline. L5: Writing Correct Programs, cont. Is this CUDA code correct? Administrative 2/11/09. Next assignment (a homework) given out on Monday

LECTURE 4. Announcements

How to Optimize Geometric Multigrid Methods on GPUs

GPU Computing: Particle Simulation

Simulation of Flow Development in a Pipe

Example 1: Color-to-Grayscale Image Processing

Massively Parallel Architectures

EXAMINATIONS 2017 TRIMESTER 2

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

GPU-accelerated data expansion for the Marching Cubes algorithm

Parallelising Pipelined Wavefront Computations on the GPU

Rigid Body Dynamics, Collision Response, & Deformation

PARALLEL SIMULATION OF A FLUID FLOW BY MEANS OF THE SPH METHOD: OPENMP VS. MPI COMPARISON. Pawe l Wróblewski, Krzysztof Boryczko

L17: Lessons from Particle System Implementations

Warps and Reduction Algorithms

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Virtual Environments

CS 314 Principles of Programming Languages

CE 530 Molecular Simulation

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

B. Tech. Project Second Stage Report on

Optimization solutions for the segmented sum algorithmic function

From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with LibGeoDecomp

Transcription:

S7260: Microswimmers on Speed: Simulating Spheroidal Squirmers on GPUs Elmar Westphal - Forschungszentrum Jülich GmbH

Spheroids Spheroid: A volume formed by rotating an ellipse around one of its axes Two kinds: oblate (rotated around its shorter axis), like a pumpkin or teapot prolate (rotated around its longer axis), like an American football 2

Squirmers Squirmer is a model for simulating micro swimmers Model origins date back to the 1950s One of today s standard models for self propelled swimmers Can use different means to swim (flagella, arm-like structures etc.), here we simulate the flow caused by surfaces covered with short cilia (filaments) Example: Paramecium bursaria Ein grünes Pantoffeltierchen Frank Fox / www.mikro-foto.de license CC BY-SA 3.0 de 3

The Simulation Our simulation is split into three parts: Movement of the squirmers and interactions between them Simulation of the liquid Interactions between the squirmers and the liquid 4

Simulation of the Squirmers Number of squirmers is low to moderate (up to a few 1000) Simulation of the squirmers and their interactions is sufficiently fast on CPU This may change for future projects 5

Simulation of the Fluid We use discrete fluid particles and an algorithm called Multi-Particle Collision Dynamics (MPC) to simulate the fluid MPC inherently conserves energy and momentum of the fluid The phenomena we want to study also require: Conservation of the angular momentum of the fluid particles Adding this roughly doubles the computational effort Walls at one dimension of our system to form a slit This requires additional ghost particles A sufficiently large simulation box for a moderate number of squirmers contains millions of fluid particles 6

Simulation of the Fluid We use discrete fluid particles and an algorithm called Multi-Particle Collision Dynamics (MPC) to simulate the fluid MPC inherently conserves energy and momentum of the fluid The phenomena we want to study also require: Conservation of the angular momentum of the fluid particles Adding this roughly doubles the computational effort Walls at one dimension of our system to form a slit This requires additional ghost particles GPU to the Rescue! A sufficiently large simulation box for a moderate number of squirmers contains millions of fluid particles 6

Please note that our simulation is in fact a 3-dimensional system. To explain the algorithms, 2D-drawings are used for the sake of simplicity. 7

Multi-particle Collision Dynamics (MPC) Fluid particles 8

Multi-particle Collision Dynamics (MPC) Fluid particles are moved ballistically One thread per particle, memory-bound 9

Multi-particle Collision Dynamics (MPC) Fluid particles are moved ballistically and sorted into cells of a randomly shifted grid One thread per particle, memory-bound 10

Multi-particle Collision Dynamics (MPC) Fluid particles are moved ballistically and sorted into cells of a randomly shifted grid For each cell the centre of mass, centre of mass velocity, kinetic energy and angular momentum are computed One thread per particle, atomic-bound, then one thread per cell, memory-bound 11

Multi-particle Collision Dynamics (MPC) Fluid particles are moved ballistically and sorted into cells of a randomly shifted grid For each cell the centre of mass, centre of mass velocity, kinetic energy and angular momentum are computed Relative velocities are calculated One thread per particle, memory-bound by random cell-data reads (texture cache) 12

Multi-particle Collision Dynamics (MPC) Fluid particles are moved ballistically and sorted into cells of a randomly shifted grid For each cell the centre of mass, centre of mass velocity, kinetic energy and angular momentum are computed Relative velocities are calculated and rotated around a random axis One thread per particle, memory-bound by random cell-data reads (texture cache) 13

Multi-particle Collision Dynamics (MPC) Fluid particles are moved ballistically and sorted into cells of a randomly shifted grid For each cell the centre of mass, centre of mass velocity, kinetic energy and angular momentum are computed Relative velocities are calculated and rotated around a random axis Angular momentum and kinetic energy need to be restored several steps using one thread per particle or cell, mostly atomic-bound 14

MPC on GPU Is limited by speed of atomic operations and memory bandwidth Implementation uses optimisations as described in some of my earlier GTC talks Reordering of particles to preserve data locality (S2036*) Reducing the number of atomic operations (S5151) *the speed of atomic operations has improved significantly over time, so parts of the implementation described here have been abandoned 15

Interactions between Fluid and Squirmers Squirmer surfaces are considered impenetrable for the fluid and are therefore boundaries for the fluid particles Collisions have to be detected Their impact has to be combined and applied to the squirmer and fluid particles accordingly This happens on the GPU (large number of fluid particles), The total impact for each squirmer is passed to and processed by the CPU (low number of squirmers) 16

Collisions Between Fluid Particles and Squirmers Fluid particles and Squirmers move during each time-step 17

Collisions Between Fluid Particles and Squirmers Fluid particles and Squirmers move during each time-step 17

Collisions Between Fluid Particles and Squirmers Fluid particles and Squirmers move during each time-step Squirmer walls are considered impenetrable X 17

Collisions Between Fluid Particles and Squirmers Fluid particles and Squirmers move during each time-step Squirmer walls are considered impenetrable Fluid particles entering the squirmer have to be dealt with accordingly 17

Collisions Between Fluid Particles and Squirmers Detecting penetration requires rotating particles into the squirmers frame of reference and scaling according to its ratio Checking every fluid particle against every squirmer is too much work, even for a GPU 18

Collisions Between Fluid Particles and Squirmers Detecting penetration requires rotating particles into the squirmers frame of reference and scaling according to its ratio Checking every fluid particle against every squirmer is too much work, even for a GPU 18

Collisions Between Fluid Particles and Squirmers Detecting penetration requires rotating particles into the squirmers frame of reference and scaling according to its ratio X Checking every fluid particle against every squirmer is too much work, even for a GPU 18

Handling Squirmer- Fluid Collisions 19

Handling Squirmer- Fluid Collisions The system is divided into cells 19

Handling Squirmer- Fluid Collisions The system is divided into cells Regardless of its orientation, a spheroid can not exceed a sphere matching its centre and largest radius 19

Handling Squirmer- Fluid Collisions The system is divided into cells Regardless of its orientation, a spheroid can not exceed a sphere matching its centre and largest radius Each squirmer has a list of the cells affected by its surrounding sphere (they may overlap) 19

Handling Squirmer- Fluid Collisions The system is divided into cells Regardless of its orientation, a spheroid can not exceed a sphere matching its centre and largest radius Each squirmer has a list of the cells affected by its surrounding sphere (they may overlap) Each cell has a list of the fluid particles it contains 19

Handling Squirmer- Fluid Collisions The system is divided into cells Regardless of its orientation, a spheroid can not exceed a sphere matching its centre and largest radius Each squirmer has a list of the cells affected by its surrounding sphere (they may overlap) Each cell has a list of the fluid particles it contains This reduces the number of checks from Nsq x Nfluid to ~Nsq x Vsq* ρ 19

Handling Squirmer Surroundings on GPU For each squirmer, a cubical box of cells is defined that contains its surrounding sphere Cells can be conveniently coded into grid- and blockdimensions: x- and y- directions are coded into threadidx z-direction is coded into blockidx.x (limit of 1024 threads per block) Squirmer-index is coded into blockidx.y Affected cells are selected using approximate cell stencils Cells are processed one per thread blockidx.x threadidx.y threadidx.x blockidx.y 20

Handling Squirmer Surroundings on GPU Threads from the described grid are used to Select affected cells Check particles from selected cells for collision and Move colliding particles accordingly Sum up collision impact to apply to squirmers Generate random fluid particles inside squirmers to preserve physical properties of the system 21

Performance Pitfalls Squirmers and MPC using walls require volumes filled with random particles Their relative impact depends on: Ratio of surface to height of the simulation box (wall algorithm adds an additional cell in one direction) Number and size of squirmers Lack of data locality in squirmer random particles has a severe impact on the performance of the MPC implementation on GPU: Fewer memory accesses can be combined Atomics optimisations depend on data locality 22

Computation Time Mostly due to the influence of the spatially unordered random particles inside the squirmers, computation times vary significantly with the number of squirmers (measured on Tesla K80): 4.9 ns per iteration and MPC-particle for large systems with 1 squirmer 13.2 ns per iteration and MPC-particle for large systems with 3692 squirmers Squirmer interactions done on CPU only ~4% of computation time 23

General Implementation Heavily templated C++-11 code MPC parts works in 2D, 3D, different precisions and with a variety of options, generating optimised code for different applications and GPU architectures Uses a class template for particle sets to allow mixed precisions and features in a single simulation User managed particle sets can be injected into most steps of the process 24

General Implementation Based on a std::vector-like class template using managed memory Replacing the allocator is not enough (members/ operators not defined for device) Easy exchange of data between CPU- and GPU-parts Uses texture caching where applicable Operator templates for CUDA vector-types (float4 etc.) make life a lot easier 25

Memory Considerations MPC particles use 52 bytes of memory each MPC cells use 160 bytes of memory each (with angular momentum conservation) Squirmer data uses about 600 bytes of memory Random particles inside squirmers do not count, because they replace particles of the original fluid On recent GPUs, this allows system sizes of up to ~10M cells and ~15K squirmers 26

Outlook The next version (currently testing) will definitely feature MPI parallelisation, allowing larger simulations and/or higher speed Hybrid code using template magic to generate CPU or GPU based executables from the same sources (real kernels, not pragma-based) Fewer restraints through the use of even more templates and is prepared for Bringing some (spatial) order to the random particle chaos Bit-true, reproducible calculations 27

Example 800 Squirmers in slit, Lx=Lz=300, Ly=7, ~7M fluid particles, 20M timesteps, 2-3 weeks walltime Simulation and rendering courtesy of Mario Theers Further reading: DOI: 10.1039/C6SM01424K (Paper) Soft Matter, 2016, 12, 7372-7385 28

Example 800 Squirmers in slit, Lx=Lz=300, Ly=7, ~7M fluid particles, 20M timesteps, 2-3 weeks walltime Simulation and rendering courtesy of Mario Theers Further reading: DOI: 10.1039/C6SM01424K (Paper) Soft Matter, 2016, 12, 7372-7385 28

Thank you for your time Questions? ~100 Squirmers in flow, Lx=300, Ly=Lz=60, ~11M fluid particles, 1M timesteps, 14h walltime Simulation courtesy of Hemalatha Annepu, rendering courtesy of Mario Theers 29

Thank you for your time Questions? ~100 Squirmers in flow, Lx=300, Ly=Lz=60, ~11M fluid particles, 1M timesteps, 14h walltime Simulation courtesy of Hemalatha Annepu, rendering courtesy of Mario Theers 29