RAMSES on the GPU: An OpenACC-Based Approach

Similar documents
Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA

GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能

Fault tolerant issues in large scale applications

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Challenges in adapting Particle-In-Cell codes to GPUs and many-core platforms

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

Experiences with ENZO on the Intel Many Integrated Core Architecture

CSCS Proposal writing webinar Technical review. 12th April 2015 CSCS

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

Enzo-P / Cello. Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology. San Diego Supercomputer Center. Department of Physics and Astronomy

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

AREPO: a moving-mesh code for cosmological hydrodynamical simulations

Software and Performance Engineering for numerical codes on GPU clusters

Experiences with ENZO on the Intel R Many Integrated Core (Intel MIC) Architecture

Performance of the hybrid MPI/OpenMP version of the HERACLES code on the Curie «Fat nodes» system

Programming NVIDIA GPUs with OpenACC Directives

Optimization of PIERNIK for the Multiscale Simulations of High-Redshift Disk Galaxies

A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications

Scientific Computing at Million-way Parallelism - Blue Gene/Q Early Science Program

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Optimising the Mantevo benchmark suite for multi- and many-core architectures

ORAP Forum October 10, 2013

Large scale Imaging on Current Many- Core Platforms

arxiv: v1 [cs.ms] 8 Aug 2018

Porting COSMO to Hybrid Architectures

PLAN-E Workshop Switzerland. Welcome! September 8, 2016

Introducing OpenMP Tasks into the HYDRO Benchmark

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

Optimization and porting of a numerical code for simulations in GRMHD on CPU/GPU clusters PRACE Winter School Stage

ACCELERATION OF A COMPUTATIONAL FLUID DYNAMICS CODE WITH GPU USING OPENACC

OpenACC 2.6 Proposed Features

High performance Computing and O&G Challenges

Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

PREPARING AN AMR LIBRARY FOR SUMMIT. Max Katz March 29, 2018

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

PiTP Summer School 2009

Evaluating New Communication Models in the Nek5000 Code for Exascale

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Deutscher Wetterdienst

Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Alan Humphrey, Qingyu Meng, Brad Peterson, Martin Berzins Scientific Computing and Imaging Institute, University of Utah

CUDA Experiences: Over-Optimization and Future HPC

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Cray XC Scalability and the Aries Network Tony Ford

Optimised all-to-all communication on multicore architectures applied to FFTs with pencil decomposition

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help?

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P

OpenACC programming for GPGPUs: Rotor wake simulation

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

arxiv: v1 [physics.ins-det] 11 Jul 2015

Computational Fluid Dynamics (CFD) using Graphics Processing Units

Lagrangian methods and Smoothed Particle Hydrodynamics (SPH) Computation in Astrophysics Seminar (Spring 2006) L. J. Dursi

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

GPU. OpenMP. OMPCUDA OpenMP. forall. Omni CUDA 3) Global Memory OMPCUDA. GPU Thread. Block GPU Thread. Vol.2012-HPC-133 No.

OpenStaPLE, an OpenACC Lattice QCD Application

Toward Automated Application Profiling on Cray Systems

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Running the FIM and NIM Weather Models on GPUs

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

Dynamic Load Distributions for Adaptive Computations on MIMD Machines using Hybrid Genetic Algorithms (a subset)

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core

Faster Innovation - Accelerating SIMULIA Abaqus Simulations with NVIDIA GPUs. Baskar Rajagopalan Accelerated Computing, NVIDIA

Introduction to parallel Computing

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March

PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent

The Swift simulation code

Trends of Network Topology on Supercomputers. Michihiro Koibuchi National Institute of Informatics, Japan 2018/11/27

arxiv: v1 [hep-lat] 12 Nov 2013

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Addressing Heterogeneity in Manycore Applications

Illinois Proposal Considerations Greg Bauer

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System

CLAW FORTRAN Compiler source-to-source translation for performance portability

Building NVLink for Developers

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

Evaluating OpenMP s Effectiveness in the Many-Core Era

Transcription:

RAMSES on the GPU: An OpenACC-Based Approach Claudio Gheller (ETHZ-CSCS) Giacomo Rosilho de Souza (EPFL Lausanne) Romain Teyssier (University of Zurich) Markus Wetzstein (ETHZ-CSCS) PRACE-2IP project EU 7 th Framework Programme RI-283493

Cosmological Simulations Numerical simulations represent an extraordinary tool to study and solve astrophysical problems. They require: Sophisticated simulation codes, including all necessary physics and adopting suitable and effective algorithms Data processing, analysis and visualization tools necessary to process the enormous amount of generated information High-end HPC systems, that provide the necessary computing power 2

What & Why GPUs GPUs are hardware components born for graphics; They are now widely used for computing; On suitable algorithms, GPUs are much faster than CPUs so they can dramatically reduce the time to solution: o Data parallel algorithms (each piece of data is processed independently from the others) are privileged; o High flops/bytes ratio are favored; o Memory intensive (size, access) algorithms can be hard to implement and/or optimize; o Asynchronous operations supported and must be exploited; o Code development is not so hard, but getting a fast code can require a huge effort.

The RAMSES code: overview RAMSES (R.Teyssier, A&A, 385, 2002): code to study of astrophysical problems It treats at the same time various components (dark energy, dark matter, baryonic matter, photons) Includes a variety of physical processes (gravity, magnetohydrodynamics, chemical reactions, star formation, supernova and AGN feedback, etc.) Open Source Fortran 90 Code size : about 70000 lines MPI parallel (public version) OpenMP support (restricted access) http://irfu.cea.fr/phocea/vie_des_labos/ast/ast_sstechnique.php?id_ast=904 CSCS 2013 - Claudio Gheller 4

RAMSES workflow 3D Eulerian Adaptive Mesh Refinement codes. The code solves: dark matter - N-body particle-mesh technique. gravity - multigrid technique. Hydrodynamics: various shock capturing methods. A number of additional physics processes Spatial discretization through and adaptive cartesian mesh AMR provides high resolution ONLY where this is strictly necessary AMR build Communication, Balancing Gravity Hydro N-Body More physics Time loop

RAMSES: solving fluid dynamics Fluid dynamics is one of the key kernels; It is also among the most computational demanding; fluid dynamics is solved on a computational mesh solving three conservation equations: mass, momentum and energy: ρ + (ρu) =0 t (ρu)+ (ρu u)+ p = ρ φ t (ρe)+ [ρu(e + p/ρ)] = ρu φ t Flux AMR build Communication, Balancing Gravity Hydro N-Body Time loop Flux Cell i,j Flux More physics Flux

RAMSES AMR Mesh Fully Threaded Tree with Cartesian mesh CELL BY CELL refinement COMPLEX data structure IRREGULAR memory distribution AMR build Communication, Balancing Gravity Hydro Time loop N-Body More physics

Ramses memory management Non contiguous memory Different levels of refinement are mixed Cell position in memory is unpredictable 8

Ramses hydro kernel & AMR AMR grid Equations Solver 22% 11% 9

Ramses on the GPU

RAMSES: GPU hydro solver Original code and OpenACC port profiling: ------------------------------------------------------------ 100.0% 8775.076792 USER ----------------------------------------------------------- 17.2% 1510.795560 godfine1_ 14.6% 1277.922684 get3cubefather_ 8.1% 712.218697 gauss_seidel_mg_fine_ 6.1% 534.160606 interpolate_and_correct_fine_ 6.0% 523.049099 make_virtual_fine_dp_ 5.5% 485.875381 make_virtual_reverse_dp_ 4.5% 393.035509 cmp_residual_mg_fine_ 3.5% 305.048373 interpol_phi_ 3.4% 295.687628 interpol_hydro_ 3.0% 261.753846 unsplit_ 2.7% 238.224640 cmpflxm_ 2.1% 182.118400 ctoprim_ 2.0% 176.510382 build_parent_comms_mg_ 2.0% 174.412867 gauss_seidel_mg_coarse_ 1.5% 133.619786 riemann_llf_ 1.4% 125.394321 synchro_hydro_fine_ 1.3% 111.616723 restrict_residual_fine_reverse_ 1.3% 110.802651 uslope_ 1.1% 94.464831 getnborfather_ 1.1% 93.612857 interpolate_and_correct_coarse_ 1.1% 92.368937 make_virtual_mg_dp_ 1.0% 90.738933 get3cubefather_godfine_ 1.0% 85.241909 make_fine_bc_rhs_ =================================== ------------------------------------------------------- 100.0% 5718.250985 USER ------------------------------------------------------ 22.3% 1276.678459 get3cubefather_ 12.8% 732.494081 gauss_seidel_mg_fine_ 9.4% 536.284852 interpolate_and_correct_fine_ 9.1% 522.834264 make_virtual_fine_dp_ 6.9% 394.130303 cmp_residual_mg_fine_ 5.3% 304.073254 interpol_phi_ 3.1% 176.423148 gauss_seidel_mg_coarse_ 3.1% 176.036009 build_parent_comms_mg_ 2.3% 129.439764 make_virtual_reverse_dp_ 2.2% 125.986470 synchro_hydro_fine_ 2.0% 111.681912 restrict_residual_fine_reverse_ 1.7% 95.785096 interpolate_and_correct_coarse_ 1.6% 92.444358 make_virtual_mg_dp_ 1.5% 87.124500 make_fine_bc_rhs_ 1.5% 83.867781 cic_cell_ 1.0% 56.125077 gradient_phi_ 1.0% 54.569618 courant_fine_ ====================================== 33% VS 3%

Our development/testing/target system Piz Daint CRAY XC30 system @ CSCS (N.6 in Top500) Nodes: 5272 CPUs 8-core Intel SandyBridge equipped with: 32 GB DDR3 memory One NVIDIA Tesla K20X GPU with 6 GB of GDDR5 memory Overall system 42176 cores and 5272 GPUs 170+32 TB Interconnect: Aries routing and communications ASIC, and Dragonfly network topology Peak performance: 7.787 Petaflops CSCS 2013 - Claudio Gheller 12

The programming model: OpenACC Directive based API (corresponds to OpenMP for parallel programming) OpenACC (http://www.openacc-standard.org/) ü Supported by CRAY and PGI (slightly different standards, but converging) ü Finally converging (hopefully) to OpenMP ü Easier code development supports incremental development ü Suitable to Fortran ü Performance tuning not so easy (possible performance sacrifice goal is 80% CUDA) ü Can be combined with CUDA code

Moving data to/from the GPU Send data to the gpu. AMR grid s data is stored random in memory. Pack-unpack strategy level by level Has to be done every time step. Hydro variables Hydro variables Packing CPU Memory GPU Memory 1.1 % Unpacking Send to gpu

On board the GPU 1. Reorganization of memory in spatially contiguous patches, so that work can be easily split in blocks and coalescing memory can be exploited 2. Patches are grouped and pushed to the GPU cores. Groups size can be tuned in order to improve the occupancy 3. Patches build-up strongly benefits of the high memory bandwidth 4. Nested loops collapse used wherever possible 5. Gang and vector based work scheduling adopted (no particular benefit in using worker scheduling) 6. Offload data only when and where necessary (but this can be still improved ongoing work) CSCS 2013 - Claudio Gheller 15

GPU implementation AMR grid Equations Solver 1.1% 0.8% 16

Performance analysis Cosmological test with 3 levels of refinement Levels 6 to 8 Cosmo 3 Levels (6-8) T_tot Sec T_hydro Percent 155662 56218 36.1 % orig_v10_n1 155662 56218 36.1 56218 orig_v10_n2 75905 27625 36.4 27625 orig_v10_n4 36147 13207 36.5 13207 6243 35.2 % orig_v10_n8 17755 6243 35.2 6243 orig_v10_n16 8775 2918 33.3 2918 T_god_fine T_copy T_tot speedup T_hydro speedup 1 core vs 1gpu 1 cpu VS 1gpu 1048113009 2.9 % 1.49 18.68 2.07 T_god/ T_copy ACCyes_C1000_N1 104811 3009 2.9 2270 739 1.49 18.68 2.07 3.07 ACCyes_C1000_N2 49718 1425 2.9 1040 385 1.53 19.39 2.05 2.70 ACCyes_C1000_N4 23372 693 3.0 485 208 1.55 19.07 2.33 ACCyes_C1000_N8 11543 344 3.0 231 113 1.54 18.15 2.03 ACCyes_C1000_N16 5718 179 3.1 115 64 1.53 16.26 1.79 17

Performance results 65536 Hydro time strong scaling Hydro time 32768 16384 8192 4096 2048 1024 Original OpenACC port 512 256 128 1 2 4 8 16 Number of cores 18

Performance results Total time strong scaling Total time 262144 131072 65536 32768 16384 Original OpenACC port 8192 4096 1 2 4 8 16 Number of cores 19

Performance results Hydro vars + AMR vars Cosmo 3 Levels (6-8) T_tot T_hydro T_god_fine T_copy T_tot speedup Sec Percent orig_v10_n1 155662 56218 36.1 56218 orig_v10_n2 75905 27625 36.4 27625 orig_v10_n4 36147 13207 36.5 13207 orig_v10_n8 17755 6243 35.2 6243 orig_v10_n16 8775 2918 33.3 2918 2270 739 T_hydro speedup 1core VS 1gpu 1 cpu VS 1gpu T_god/ T_copy ACCyes_C1000_N1 104811 3009 2.9 2270 739 1.49 18.68 2.07 3.07 1040 385 ACCyes_C1000_N2 49718 1425 2.9 1040 385 1.53 19.39 2.05 2.70 485 208 1.49 18.68 1.53 19.39 1.55 19.07 ACCyes_C1000_N4 23372 693 3.0 485 208 1.55 19.07 2.33 231 113 1.54 18.15 ACCyes_C1000_N8 11543 344 3.0 231 113 1.54 18.15 2.03 115 64 1.53 16.26 3.07 2.70 2.33 2.03 1.79 ACCyes_C1000_N16 5718 179 3.1 115 64 1.53 16.26 1.79 20

Performance results 1024 Copy time scaling Copy time 512 256 128 64 32 Hydro variables AMR variables 16 8 1 2 4 8 16 Number of cores 21

A small simulation Small cosmological simulation with hydro, gravity and cooling: box size = 100 Mpc (10 19 km), memory 3 GB Level id Eff. Mesh size Spatial resolution (Mpc) Base level 7 256 3 0.39 Eff. level 12 8192 3 0.012 Max level 15 65536 3 0.0015 Visualization made with Splotch (https://github.com/splotchviz/splotch) S4516 @ GTC (by Mel Krokos) CSCS 2013 - Claudio Gheller 22

Results Fraction of time saved using the GPU Scalability of the CPU and GPU versions (Total time) Scalability of the CPU and GPU versions (Hydro time) CSCS 2013 - Claudio Gheller 23

and a big simulation Big cosmological simulation with hydro, gravity and cooling: box size = 100 Mpc (10 19 km) Level id Eff. Mesh size Spatial resolution (Mpc) Base level 9 1024 3 0.09765625 Eff. level 14 32768 3 0.003051758 Max level 17 262144 3 0.00038147 Memory = 240 GB AMR structure at timestep 100 à 180000544 cells 703127 cells/256 351563 cells/512 CSCS 2013 - Claudio Gheller 24

Results HYDRO Computing devices CPU (sec.) GPU (sec.) 256 80.72 10.46 512 39.81 5.16 TOTAL Computing devices CPU (sec.) GPU (sec.) 256 701.18 590.35 512 423.12 358.75 CSCS 2013 - Claudio Gheller 25

Summary Objective: RAMSES enabling to GPUs Methodology Incremental approach exploiting RAMSES modular architecture and OpenACC programming model Current achievement: Hydro kernel ported on GPU: final optimization being completed Coming steps: Enable the cooling and radiative transfer module to GPU Enable the MHD module to GPU Move MPI stuff related to hydro variables to the GPU Challenges Enable the gravitational solver to the GPU Redesign data structures 26