Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Similar documents
GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能

RAMSES on the GPU: An OpenACC-Based Approach

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

General Purpose GPU Computing in Partial Wave Analysis

Software and Performance Engineering for numerical codes on GPU clusters

Numerical Algorithms on Multi-GPU Architectures

arxiv: v1 [cs.ms] 8 Aug 2018

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Large scale Imaging on Current Many- Core Platforms

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Two-Phase flows on massively parallel multi-gpu clusters

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

PREPARING AN AMR LIBRARY FOR SUMMIT. Max Katz March 29, 2018

Center for Computational Science

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

High Performance Computing with Accelerators

Accelerating CFD with Graphics Hardware

Fluent User Services Center

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

CUDA Experiences: Over-Optimization and Future HPC

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Mathematical computations with GPUs

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Accelerating image registration on GPUs

Graphic-card cluster for astrophysics (GraCCA) Performance tests

Fast Tridiagonal Solvers on GPU

Computing on GPU Clusters

CME 213 S PRING Eric Darve

Gradient Free Design of Microfluidic Structures on a GPU Cluster

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi

Asynchronous OpenCL/MPI numerical simulations of conservation laws

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

Parallel Algorithms: Adaptive Mesh Refinement (AMR) method and its implementation

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

AREPO: a moving-mesh code for cosmological hydrodynamical simulations

Performance potential for simulating spin models on GPU

Acknowledgements. Prof. Dan Negrut Prof. Darryl Thelen Prof. Michael Zinn. SBEL Colleagues: Hammad Mazar, Toby Heyn, Manoj Kumar

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

General Plasma Physics

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Applications of Berkeley s Dwarfs on Nvidia GPUs

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

When MPPDB Meets GPU:

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Introduction to GPU hardware and to CUDA

GPUs and GPGPUs. Greg Blanton John T. Lubia

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Paralization on GPU using CUDA An Introduction

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Optimizing Multiple GPU FDTD Simulations in CUDA

Lagrangian methods and Smoothed Particle Hydrodynamics (SPH) Computation in Astrophysics Seminar (Spring 2006) L. J. Dursi

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Intermediate Parallel Programming & Cluster Computing

Heterogeneous Multi-Computer System A New Platform for Multi-Paradigm Scientific Simulation

GPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Using GPUs to compute the multilevel summation of electrostatic forces

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Experiences with ENZO on the Intel R Many Integrated Core (Intel MIC) Architecture

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Shallow Water Simulations on Graphics Hardware

arxiv: v1 [physics.ins-det] 11 Jul 2015

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

International Supercomputing Conference 2009

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

Parallel Processing SIMD, Vector and GPU s cont.

Multicore Hardware and Parallelism

Towards real-time prediction of Tsunami impact effects on nearshore infrastructure

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

Adaptive Mesh Refinement

Analysis and Visualization Algorithms in VMD

T6: Position-Based Simulation Methods in Computer Graphics. Jan Bender Miles Macklin Matthias Müller

X-TRACT: software for simulation and reconstruction of X-ray phase-contrast CT

Scalability of Processing on GPUs

Advances of parallel computing. Kirill Bogachev May 2016

Computational Fluid Dynamics (CFD) using Graphics Processing Units

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

By : Veenus A V, Associate GM & Lead NeST-NVIDIA Center for GPU computing, Trivandrum, India Office: NeST/SFO Technologies, San Jose, CA,

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

High Performance Computing (HPC) Introduction

Experts in Application Acceleration Synective Labs AB

Simulating Shallow Water on GPUs Programming of Heterogeneous Systems in Physics

Transcription:

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics (LeCosPA) T. Chiueh ( 闕志鴻 ), Y. C. Tsai ( 蔡御之 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics (LeCosPA) Workshop on GPU Supercomputing (1/16/2009)

GPU Applications From the smallest scale (QCD, Quantum Spin System) to the largest scale (Astrophysics & Cosmology)

Outline Introduction GraCCA (Graphic-Card Cluster for Astrophysics) system and previous work AMR Hydrodynamics + Self-Gravity Simulation in GPUs Conclusion and Future Work

Introduction : GPU vs. CPU Faster, Faster, Faster!!! GPU : low clock rate, multi-processors GTX 280 1.30 GHz, 240 processors 30 multiprocessors : each has 16 KB fast shared memory ~ 933 GFLOPS CPU : high clock rate, few processors Intel Core 2 Quad Q9300 2.5 GHz, quad-core ~ 40 GFLOPS 23 times faster

Programming interface : CUDA (Compute Unified Device Architecture) GPU multithreaded coprocessor to CPU Execute thousands of threads in parallel All threads execute the same kernel Kernel Thread (1) Thread (2)... Thread (N) Processor (1) Processor (2)... Processor (128) GPU

GraCCA Graphic-Card Cluster for Astrophysics

Architecture 18 nodes, 36 GPUs Theoretical performance : 518.4*36 = 18.7 TFLOPS Network : gigabit Ethernet Hardwares in each node Hardware Model Amount Graphic Card NVIDIA GeForce 8800 GTX 2 Motherboard Gigabyte GA-M59SLI S5 1 CPU AMD Athlon 64 X2 3800 1 Power Supply Thermaltake Toughpower 750W 1 RAM DDR2-667 2GB RAM 4 Hard Disk Seagate 80G SATAII 1

Architecture Gigabit Network Switch 1 Gigabit Network Card......... PC Memory (DDR2-667, 2 GB) PCI-Express x16 CPU PCI-Express x16 Node 18 GPU Memory GPU Memory (GDDR3, 768 MB) (GDDR3, 768 MB) G80 GPU G80 GPU Graphic Card 1 Graphic Card 2 Node 1

Photos of GraCCA Multi-node Single-node

Previous Work : Parallel Direct N-body Simulation ~ Schive et al., 2008. NewA 13, 418. Speed (GFLOPS) 1.E+04 1.E+03 1.E+02 250x speed-up over a single CPU Ngpu = 1 Ngpu = 2 Ngpu = 4 Ngpu = 8 Ngpu = 16 Ngpu = 32 for N = 1024k : Single GPU : 257 GFLOPS 32 GPUs : 6.62 TFLOPS 1.E+01 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 N

Core Collapse in Globular Cluster Initial condition : Plummer s model It took about one month for the N=64k case One of a few groups having the computation capability for simulating the core collapse for N=64k 14 N = 64K N = 32K 12 log( Core Density ) 10 8 6 N = 16K N = 8K 4 2 0 0 100 200 300 400 500 600 700 800 900 Scaled N body time

AMR Hydrodynamics Simulation in GPUs

PDE in Hydrodynamics Conservation laws of mass, momentum, and energy ρ t ( ρ t ( E) t + ( r v) ρ r v) + ( ρ = 0 rr vv + P) = r + ( v( E+ P)) = ρ ρ φ r v φ ρ: density v : velocity P : pressure ψ: potential E : energy density

Adaptive-Mesh-Refinement Boring Region : flat, empty, low error coarse mesh Interesting Region : high density, high contrast, high error fine mesh

Example : Sedov-Taylor Blast Wave 0 1 2 Density spherical shock compression ratio ~ 3.5 1 3 refine levels (128 3 512 3 )

Sedov-Taylor Blast Wave Density

Basic Scheme 2 nd -order TVD scheme for the fluid solver SOR method for the Poisson solver Hierarchical oct-tree data-structure Basic unit : patch ( fixed number of grids ) Patch in level 0 (2*2 grids) Patch in level 1 (2*2 grids) Patch in level 2 (2*2 grids)

GPU Acceleration Two main tasks in the AMR program: 1. Patch construction : decision making, interpolation, complex data-structure, data assignment ~ complicated, but time-saving CPU 2. 3-D hydrodynamics + Poisson solver : ~ straightforward, but time-consuming GPU feed with hundreds of patches simultaneously

Parallel Evaluation of Multi- Patches in a Single GPU 0 1 2 Multiprocessor (1) Multiprocessor (2) 1 Multiprocessor (3)... Multiprocessor (16) GPU

Concurrent Execution in CPU and GPU Preparing data for the GPU fluid solver (data copy, interpolation ) is also very time-consuming!! Hide this preparing time by the asynchronous execution in GPU time CPU Prepare patch 2 Prepare patch 3 GPU Evaluate patch 1 Evaluate patch 2

Concurrent Memory Copy and Kernel Execution The bandwidth between CPU and GPU is only 4 GB/s just not high enough!!! Hide this data-transferring time by the concurrent memory copy (between CPU and GPU ) and execution in GPU time 16x PCI-E Transfer patch 2 Transfer patch 3 GPU Evaluate patch 1 Evaluate patch 2

Performance (hydrodynamics only) Single GPU vs. single CPU (64 3, 128 3, 256 3, 512 3 ) 14 12 12.3x speed-up speed-up ratio 10 8 6 4 2 0 0 100 200 300 400 500 600 simulation size

Poisson Solver in GPU Successive Over-Relaxation method (SOR) Given the boundary condition, the SOR method will iteratively approach the solution of the Poisson equation The patch with 8 3 grids can be perfectly fit into the shared memory of GPU (16 KB per multiprocessor in the GeForce 8800 GTX) only need to transfer data between global memory and shared memory before and after the iteration loop more iterations, higher performance

Performance of the SOR in GPU Single GPU vs. single CPU 17.5x speed-up for iterations ~ 40 25 20 speed-up ratio 15 10 5 0 10 100 1000 iteration

Multi-GPUs Each CPU and GPU handle a sub-domain Exchanging data by MPI CPU 0 GPU 0 CPU 2 GPU 2 Data-transfer (gigabit-network) CPU 1 GPU 1 CPU 3 GPU 3

Network Bandwidth The computation is highly improved, but the communication is NOT!! Gigabit Ethernet bandwidth ~ only 128 MB/s We must minimize the amount of data to be transferred!!! possible direction for data transfer

Performance (multi GPUs) 512^3 run : 8 GPUs vs. 8 CPUs: 10.0x speed-up 1024^3 run : 8 GPUs vs. 8 CPUs: 9.5x speed-up speed-up ratio 9 8 7 6 5 4 3 2 1 0 512^3 run Measured Ideal 0 2 4 6 8 10 number of GPUs

Demo : Kelvin-Helmholtz Instability

Performance in the state-of-art GPU Performance in the GTX 280 GPU Hydrodynamics solver : 1192 ms 638 ms Poisson solver : 336 ms 154 ms The performance is further improved by a factor of 2 But the speed-up ratio of an upgraded GPU over an upgraded CPU is about the same

Conclusion and Future Work Parallel GPUs-accelerated AMR hydrodynamics program 1 GPU vs. 1 CPU : 12.3x speed-up 8 GPUs vs. 8 CPUs : 10.0x speed-up GPU-accelerated Poisson solver in GPU 17.5x speed-up for 40 iterations Future work Complete the Poisson solver Dark matter particles Load balance MHD Optimization in the latest GPU (GTX 280, Tesla S1070)