From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with LibGeoDecomp

Similar documents
Parallel Simulation of Dendritic Growth On Unstructured Grids

Software and Performance Engineering for numerical codes on GPU clusters

Large scale Imaging on Current Many- Core Platforms

Massively Parallel Phase Field Simulations using HPC Framework walberla

The walberla Framework: Multi-physics Simulations on Heterogeneous Parallel Platforms

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla

Numerical Algorithms on Multi-GPU Architectures

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

International Supercomputing Conference 2009

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Towards Generating Solvers for the Simulation of non-newtonian Fluids. Harald Köstler, Sebastian Kuckuk FAU Erlangen-Nürnberg

walberla: Developing a Massively Parallel HPC Framework

simulation framework for piecewise regular grids

Computational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm

Two-Phase flows on massively parallel multi-gpu clusters

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

Overview of research activities Toward portability of performance

A Python extension for the massively parallel framework walberla

A Peta-scale LES (Large-Eddy Simulation) for Turbulent Flows Based on Lattice Boltzmann Method

High Performance Computing

HPX. HPX The Futurization of Computing

Performance and Software-Engineering Considerations for Massively Parallel Simulations

GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

Asynchronous OpenCL/MPI numerical simulations of conservation laws

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion

Using GPUs for unstructured grid CFD

HPX A GENERAL PURPOSE C++ RUNTIME SYSTEM FOR PARALLEL AND DISTRIBUTED APPLICATIONS OF ANY SCALE

Parallel Computing. November 20, W.Homberg

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

A Tutorial on CUDA Performance Optimizations

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method

Parallelism paradigms

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

Multigrid algorithms on multi-gpu architectures

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

PARALLEL AND DISTRIBUTED COMPUTING

(LSS Erlangen, Simon Bogner, Ulrich Rüde, Thomas Pohl, Nils Thürey in collaboration with many more

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Knights Landing Scalability and the Role of Hybrid Parallelism

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Center Extreme Scale CS Research

JCudaMP: OpenMP/Java on CUDA

Ryan C. Hulguin TACC-Intel Highly Parallel Computing Symposium April 10th-11th, 2012 Austin, TX

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

Optimization of finite-difference kernels on multi-core architectures for seismic applications

Peta-Scale Simulations with the HPC Software Framework walberla:

From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133)

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Intel Array Building Blocks (Intel ArBB) Technical Presentation

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

Exam Issued: May 29, 2017, 13:00 Hand in: May 29, 2017, 16:00

Scientific Programming in C XIV. Parallel programming

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

Intro to Parallel Computing

The Art of Parallel Processing

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge.

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

The Cut and Thrust of CUDA

Computing on GPU Clusters

Introduction to Parallel Programming

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Performance Engineering - Case study: Jacobi stencil

Challenges in Fully Generating Multigrid Solvers for the Simulation of non-newtonian Fluids

Parallel Computing. Hwansoo Han (SKKU)

How to Optimize Geometric Multigrid Methods on GPUs

Fall CSE 633 Parallel Algorithms. Cellular Automata. Nils Wisiol 11/13/12

Multi-Level Parallelism for Incompressible Flow Computations on GPU Clusters

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

High Performance Ocean Modeling using CUDA

High performance Computing and O&G Challenges

S7260: Microswimmers on Speed: Simulating Spheroidal Squirmers on GPUs

FLUID SIMULATION. Kristofer Schlachter

Unstructured Finite Volume Code on a Cluster with Mul6ple GPUs per Node

Cartoon parallel architectures; CPUs and GPUs

Preliminary Discussion

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Turbostream: A CFD solver for manycore

Parallel Programming Libraries and implementations

14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

Lattice Boltzmann with CUDA

EECS4201 Computer Architecture

Efficient Imaging Algorithms on Many-Core Platforms

Transcription:

From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with andreas.schaefer@cs.fau.de Friedrich-Alexander-Universität Erlangen-Nürnberg GPU Technology Conference 2013, San José, CA 2013.03.21

Outline 1 Applications 2 Architecture 3 Usage

1. Applications What can it do?

Parallel IMage Particle Flow (PIMPF)

Parallel IMage Particle Flow (PIMPF) particle-in-cell 300k particles, 30 FPS runs on PC, Android control PC via tablet

Computational Fluid Dynamics Lattice Boltzmann Method toolkit 2D/3D solid/fluid interaction various boundary conditions: - free-slip, no-slip, moving wall - Zou He (pressure, velocity)

Simulation of Granular Gases n-body code periodic boundaries uniform distribution model: Institute for Multiscale Simulation, FAU

Simulation of Dendritic Growth 1000 2, 8 hours 1000 3, 3 years? cellular automaton + finite difference code model: Institute for Metallic Materials, FSU

Simulation of Dendritic Growth (cont.) Parallelization Options manual parallelization reimplement with problem solving environment reimplement with parallel programming language port to

Simulation of Dendritic Growth (cont.) Parallelization Options manual parallelization reimplement with problem solving environment reimplement with parallel programming language port to Port to : Success! total code base: 5000 LOC (Lines Of Code) port: 100 LOC changed video: 2 TB output 7 days compute, 3 days visualization active usage since 2011

2. Architecture How does it work?

Target Applications Library for Geometric Decomposition codes () time- and space-discrete simulations chiefly stencil codes: regular 2D/3D grid challenges: computations are tightly coupled (communication, load-balancing) vectorization, ILP, caches, multi-cores, GPUs, MPI...

Architecture Cell: user supplied model Simulator: generic parallelization (C++ templates) handles spatial/temporal loops calls back user code

Architecture Cell: user supplied model Simulator: generic parallelization (C++ templates) handles spatial/temporal loops calls back user code plug-ins for: OpenMP CUDA MPI HPX

Architecture Cell: user supplied model Simulator: generic parallelization (C++ templates) handles spatial/temporal loops calls back user code plug-ins for: OpenMP CUDA MPI HPX features: hierarchical parallelization parallel IO live-steering, in situ visualization dynamic load-balancing...

Performance TFLOPS 45 40 35 30 25 20 15 10 5 RTM Performance 0 0 200 400 600 800 1000 1200 GPUs Reverse Time Migration (RTM) on Tsubame 2.0 scaled to 1080 GPUs (>480 k cores) weak-scaling efficiency >90 %

3. Usage How can I use it?

3D Jacobi Smoother: NaïveCode heat dissipation periodic boundary conditions well-known example seemingly simple plenty of potential for optimization plenty of pitfalls

3D Jacobi Smoother: NaïveCode (cont.) for ( i n t z = 0; z < griddim ; ++z ) { for ( i n t y = 0; y < griddim ; ++y ) { for ( i n t x = 0; x < griddim ; ++x ) { SET ( ) = (GET( 0, 0, 1) + GET( 0, 1, 0) + GET( 1, 0, 0) + GET( 1, 0, 0) + GET( 0, 1, 0) + GET( 0, 0, 1 ) ) ( 1. 0 / 6. 0 ) ;.

3D Jacobi Smoother: NaïveCode (cont.) #define GET(X, Y, Z ) ( gridold ) [ \ ( ( z + Z + griddim ) % griddim ) griddim griddim + \ ( ( y + Y + griddim ) % griddim ) griddim + \ ( ( x + X + griddim ) % griddim ) ] #define SET ( ) ( gridnew ) [ \ ( ( z + griddim ) % griddim ) griddim griddim + \ ( ( y + griddim ) % griddim ) griddim + \ ( ( x + griddim ) % griddim ) ] for ( i n t z = 0; z < griddim ; ++z ) { for ( i n t y = 0; y < griddim ; ++y ) { for ( i n t x = 0; x < griddim ; ++x ) { SET ( ) = (GET( 0, 0, 1) + GET( 0, 1, 0) + GET( 1, 0, 0) + GET( 1, 0, 0) + GET( 0, 1, 0) + GET( 0, 0, 1 ) ) ( 1. 0 / 6. 0 ) ; 0.10 GLUPS, 1 thread Intel Core i7-3610qm @2.30 GHz (Ivy Bridge) Giga Lattice Updates Per Second (GLUPS) matrix size: 512 3 cells

3D Jacobi Smoother: NaïveCode OpenMP #pragma omp parallel for for ( i n t z = 0; z < griddim ; ++z ) { for ( i n t y = 0; y < griddim ; ++y ) { for ( i n t x = 0; x < griddim ; ++x ) { SET ( ) = (GET( 0, 0, 1) + GET( 0, 1, 0) + GET( 1, 0, 0) + GET( 1, 0, 0) + GET( 0, 1, 0) + GET( 0, 0, 1 ) ) ( 1. 0 / 6. 0 ) ; 0.37 GLUPS, 4 threads Intel Core i7-3610qm @2.30 GHz (Ivy Bridge) speedup 3.6

3D Jacobi Smoother: Model class C e l l { public : typedef Stencils : : VonNeumann<3, 1> Stencil ; typedef Topologies : : Torus <3 >:: Topology Topology ; class API : public CellAPITraits : : Fixed { ; s t a t i c i n l i n e unsigned nanosteps ( ) { return 1; i n l i n e e x p l i c i t Cell ( const double& v=0) : temp ( v ) {

3D Jacobi Smoother: Model class C e l l { public : typedef Stencils : : VonNeumann<3, 1> Stencil ; typedef Topologies : : Torus <3 >:: Topology Topology ; class API : public CellAPITraits : : Fixed { ; s t a t i c i n l i n e unsigned nanosteps ( ) { return 1; i n l i n e e x p l i c i t Cell ( const double& v=0) : temp ( v ) { #define GET(X, Y, Z ) hood [ FixedCoord<X, Y, Z > ( ) ]. temp #define SET ( ) temp template <typename COORD_MAP> void update ( const COORD_MAP& hood, const unsigned& nanostep ) { SET ( ) = (GET( 0, 0, 1) + GET( 0, 1, 0) + GET( 1, 0, 0) + GET( 1, 0, 0) + GET( 0, 1, 0) + GET( 0, 0, 1 ) ) ( 1. 0 / 6. 0 ) ; ; double temp ;

3D Jacobi Smoother: Simple #include <libgeodecomp. h> class C e l l {... ; class C e l l I n i t i a l i z e r {... ; i n t main ( i n t argc, char argv [ ] ) { C e l l I n i t i a l i z e r i n i t = new C e l l I n i t i a l i z e r ( griddim, steps ) ; S e r i a l S i m u l a t o r <Cell > sim ( i n i t ) ; sim. run ( ) ; return 0; 0.40 GLUPS, 1 thread

3D Jacobi Smoother: Simple #include <libgeodecomp. h> class C e l l {... ; class C e l l I n i t i a l i z e r {... ; i n t main ( i n t argc, char argv [ ] ) { C e l l I n i t i a l i z e r i n i t = new C e l l I n i t i a l i z e r ( griddim, steps ) ; S e r i a l S i m u l a t o r <Cell > sim ( i n i t ) ; sim. run ( ) ; return 0; 0.40 GLUPS, 1 thread 0.50 GLUPS, 4 threads Intel Core i7-3610qm @2.30 GHz (Ivy Bridge)

3D Jacobi Smoother: Cache Blocking #include <libgeodecomp. h> class C e l l {... ; class C e l l I n i t i a l i z e r {... ; i n t main ( i n t argc, char argv [ ] ) { C e l l I n i t i a l i z e r i n i t = new C e l l I n i t i a l i z e r ( griddim, steps ) ; CacheBlockingSimulator<Cell> sim(init, 3, Coord<2>(270, 30)); sim. run ( ) ; return 0; 1.04 GLUPS, 4 threads Intel Core i7-3610qm @2.30 GHz (Ivy Bridge)

3D Jacobi Smoother: GPU #include <libgeodecomp. h> class C e l l {... ; class C e l l I n i t i a l i z e r {... ; i n t main ( i n t argc, char argv [ ] ) { C e l l I n i t i a l i z e r i n i t = new C e l l I n i t i a l i z e r ( griddim, steps ) ; CudaSimulator<Cell> sim(init); sim. run ( ) ; return 0; 2.48 GLUPS, Tesla C2050

3D Jacobi Smoother: GPU #include <libgeodecomp. h> class C e l l {... ; class C e l l I n i t i a l i z e r {... ; i n t main ( i n t argc, char argv [ ] ) { C e l l I n i t i a l i z e r i n i t = new C e l l I n i t i a l i z e r ( griddim, steps ) ; CudaSimulator<Cell> sim(init); sim. run ( ) ; return 0; 2.48 GLUPS, Tesla C2050 4.87 GLUPS, Tesla K20 47x faster than naïvecode

Summary Self-Adapting Stencil Codes for the Grid http://www.libgeodecomp.org single node performance matters (47x speedup...) scalability!= peak performance delivers both thanks: