GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能

Similar documents
Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA

RAMSES on the GPU: An OpenACC-Based Approach

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Building NVLink for Developers

arxiv: v1 [cs.ms] 8 Aug 2018

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Asynchronous OpenCL/MPI numerical simulations of conservation laws

GPU Acceleration of Particle Advection Workloads in a Parallel, Distributed Memory Setting

High Performance Computing with Accelerators

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Introduction to parallel Computing

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

GPU Architecture. Alan Gray EPCC The University of Edinburgh

A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

High performance computing and numerical modeling

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

Applications of Berkeley s Dwarfs on Nvidia GPUs

Technology for a better society. hetcomp.com

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Parallel FFT Program Optimizations on Heterogeneous Computers

GPUs and GPGPUs. Greg Blanton John T. Lubia

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

Experiences with ENZO on the Intel Many Integrated Core Architecture

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

Hybrid Implementation of 3D Kirchhoff Migration

Efficient Imaging Algorithms on Many-Core Platforms

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System

PREPARING AN AMR LIBRARY FOR SUMMIT. Max Katz March 29, 2018

Analysis and Visualization Algorithms in VMD

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

CME 213 S PRING Eric Darve

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Developing PIC Codes for the Next Generation Supercomputer using GPUs. Viktor K. Decyk UCLA

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with LibGeoDecomp

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Adaptive Refinement Tree (ART) code. N-Body: Parallelization using OpenMP and MPI

Large scale Imaging on Current Many- Core Platforms

Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms

Addressing Heterogeneity in Manycore Applications

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Fast Tridiagonal Solvers on GPU

Numerical Algorithms on Multi-GPU Architectures

Introduction II. Overview

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Accelerating Implicit LS-DYNA with GPU

A Hybrid GPU/CPU FFT Library for Large FFT Problems

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Mathematical computations with GPUs

Optimisation Myths and Facts as Seen in Statistical Physics

The Art of Parallel Processing

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Experiences with ENZO on the Intel R Many Integrated Core (Intel MIC) Architecture

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging

Parallel Computing. Hwansoo Han (SKKU)

Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Intermediate Parallel Programming & Cluster Computing

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

General Plasma Physics

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

Accelerating Molecular Modeling Applications with Graphics Processors

Software and Performance Engineering for numerical codes on GPU clusters

GPU COMPUTING WITH MSC NASTRAN 2013

Scalability of Uintah Past Present and Future

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Pedraforca: a First ARM + GPU Cluster for HPC

The Swift simulation code

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Center for Computational Science

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

General Purpose GPU Computing in Partial Wave Analysis

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Two-Phase flows on massively parallel multi-gpu clusters

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio

Performance Benefits of NVIDIA GPUs for LS-DYNA

STUDYING OPENMP WITH VAMPIR

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Multigrid algorithms on multi-gpu architectures

Introduction to GPU hardware and to CUDA

Alan Humphrey, Qingyu Meng, Brad Peterson, Martin Berzins Scientific Computing and Imaging Institute, University of Utah

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

Performance potential for simulating spin models on GPU

Transcription:

GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能 Hsi-Yu Schive ( 薛熙于 ), Tzihong Chiueh ( 闕志鴻 ), Yu-Chih Tsai ( 蔡御之 ), Ui-Han Zhang ( 張瑋瀚 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics (LeCosPA) NVIDIA GTC (May 19, 2011)

Outline Introduction to GPU Graphic-Processing-Unit Introduction to AMR Adaptive-Mesh-Refinement GPU + AMR GAMER GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics Optimization and Performance Applications

GPU Graphic-Processing-Unit

Graphic-Processing-Unit (GPU) NVIDIA Quadro 6000 Animations, video games, data visualization

Graphic-Processing-Unit (GPU) NVIDIA Quadro 6000 Astrophysics??

Performance & Bandwidth Performance: GPU vs. CPU ~ 10x Bandwidth: GPU vs. CPU ~ 6x

GPUs + Direct N-body GraCCA system (2006) Graphic-Card Cluster for Astrophysics 16 nodes, 32 GPUs (GeForce 8800 GTX) Peak performance: 16.2 TFLOPS Parallel direct N-body simulation in GraCCA Individual/shared time-step 4 th order Hermite integrator 7.1 TFLOPS GPU/CPU speed-up ~ 200 Ref: Schive, H-Y., et al. 2008, NewA, 13, 418

AMR Adaptive-Mesh-Refinement

Uniform Mesh Pros Relatively easy to program Relatively easy to parallelize Cons Waste computational time Waste memory Lower resolution

Adaptive-Mesh-Refinement (AMR) Resolution adaptively changes with space and time Flexible refinement criteria (e.g., density magnitude)

AMR Example Kelvin-Helmholtz instability Refinement criterion: vorticity magnitude V Base level 128 2, refined level 4 2,048 2 effectively Layer 2 Layer 1 Layer 2

AMR Example Layer 2 Layer 1 Layer 2

GAMER GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics

AMR Scheme in GAMER Refinement unit : patch (containing a fixed number of cells, e.g., 8 3 ) Support GPU hydro and gravity solvers Hierarchical oct-tree data structure Patch at refinement level 0 Patch at refinement level 1 Patch at refinement level 2

Example : Blast Wave Test 0 1 GPU 2 0 GPU 1 1 Multiprocessor (1) Multiprocessor (2) Multiprocessor (3) GPU 2 GPU 3... Multiprocessor (16) GPU 1

Example : Blast Wave Test Patch 1 Multiprocessor 1 0 1 GPU 2 0 GPU 1 1 Thread 1 Thread 2 Thread 3 Multiprocessor (1) Multiprocessor (2) Multiprocessor (3) Thread 4 GPU 2 GPU 3... Multiprocessor (16) GPU 1

Optimization

Wall-clock time (s) CPU vs. CPU + GPU Dominant factors : Fluid & Gravity solvers 800 700 731.3 600 500 400 300 200 356.4 81x 349.7 76x 24x CPU GPU 100 0 11.4 4.4 4.6 30.4 6.6 0.27 0.18 1.1

Percentage (%) Optimization I : Asynchronous Memory Copy Data transfer between CPU and GPU : 27% ~ 34% of the total GPU execution time!! Use CUDA streams to perform memory copy concurrently with kernel execution 300 250 200 150 150 226 173 23% 217 298 239 20% CPU -> GPU GPU kernel GPU -> CPU 100 50 44 57 29 24 Total Total with streams 0 Fluid Gravity

Wall-clock time (s) Optimization I : Asynchronous Memory Copy 30 25 27.9 (26x) 20 15 10 5 0 11.4 (105x) 3.4 6.6 (100x) 3.5 0.27 0.18 1.1 CPU GPU

Optimization II : OpenMP Fully exploit the multi-core CPU computing power N GPUs + K CPU cores (N K) CPU GPU CPU GPU Core [1] Core [1] OpenMP Core [1] Core [1] Core [2] Core [2] Core [2] Core [2] Core [N] Core [K] Core [N] Core [K]

Wall-clock time (s) Optimization II : OpenMP Fully exploit the multi-core CPU computing power N GPUs + K CPU cores (N K) 30 27.9 25 1.87x 20 15 10 5 0 11.4 (105x) 6.6 (100x) 3.4 3.4 3.5 2.4 0.27 0.18 1.1 14.9 (49x) CPU OpenMP GPU

Wall-clock time (s) Optimization III : Concurrent Execution between CPU and GPU Invoking GPU kernels and transferring data between CPU and GPU are asynchronous!! 30 27.9 25 2.71x 20 15 10 5 0 11.4 (105x) 6.6 3.4 3.4 2.4 (100x) 3.5 0.27 0.18 1.1 14.9 10.3 (71x) CPU OpenMP GPU Optimized

Optimization IV : Space-filling Curve for Domain Decomposition The rectangular domain decomposition can lead to an issue of load imbalance. More load Less load

Optimization IV : Space-filling Curve for Domain Decomposition The standard space-filling curve method can be applied to GAMER (not complete yet)

Performance

Performance : Single GPU NERSC Dirac GPU Cluster GPU: 1 NVIDIA Tesla C2050 CPU: 1 Intel Xeon E5530 2.25x With self-gravity (80x speedup in GPU) and individual time-step 1.38x 1.11x Stream : PCI-E/GPU overlap Async : CPU/GPU overlap OMP(4) : 4 OpenMP threads GAMER-optimized vs. 1 CPU core : 84x 4 CPU cores: 22x

Performance : GPU Cluster NERSC Dirac GPU Cluster GPU: 1-32 NVIDIA Tesla C2050 CPU: 1-32 Intel Xeon E5530 With self-gravity (80x speed-up in GPU) and individual time-step Stream : PCI-E/GPU overlap Async : CPU/GPU overlap OMP(4) : 4 OpenMP threads 32 GPU vs. 32 CPU cores: 71x 32 GPU vs. 128 CPU cores: 18x Equivalent to 2,304 CPU cores MPI ~ 11% of T total

Applications

I : Large-scale Structure 100 h -1 Mpc comoving box Effective resolution: 8,192 3 & 32,768 3 Purely baryonic Dark matter to be added Speed-up: ~70x

II : Bosonic Dark Matter Schrö dinger eq. with selfgravity Use GAMER as GPU+AMR framework 10 h -1 Mpc comoving box Effective resolution: 32,768 3 Speed-up: ~40x

Conclusion GAMER : GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics A framework of AMR + GPUs general-purpose Hybrid MPI/OpenMP/GPU parallelization (multi CPUs + multi GPUs) 70x ~ 100x speed-up (1 GPU vs. 1 CPU core) GAMER ref : (1) Schive, H-Y., et al. 2010, ApJS, 186, 457 Optimizations (2) arxiv: 1103.3373 Asynchronous memory copies Hybrid OpenMP/MPI parallelization Concurrent execution between CPU and GPU Space-filling curve for load balance