GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能

Size: px

Start display at page:

Download "GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能"

Alannah Powell
6 years ago
Views:

1 GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能 Hsi-Yu Schive ( 薛熙于 ), Tzihong Chiueh ( 闕志鴻 ), Yu-Chih Tsai ( 蔡御之 ), Ui-Han Zhang ( 張瑋瀚 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics (LeCosPA) NVIDIA GTC (May 19, 2011)

2 Outline Introduction to GPU Graphic-Processing-Unit Introduction to AMR Adaptive-Mesh-Refinement GPU + AMR GAMER GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics Optimization and Performance Applications

3 GPU Graphic-Processing-Unit

4 Graphic-Processing-Unit (GPU) NVIDIA Quadro 6000 Animations, video games, data visualization

5 Graphic-Processing-Unit (GPU) NVIDIA Quadro 6000 Astrophysics??

6 Performance & Bandwidth Performance: GPU vs. CPU ~ 10x Bandwidth: GPU vs. CPU ~ 6x

7 GPUs + Direct N-body GraCCA system (2006) Graphic-Card Cluster for Astrophysics 16 nodes, 32 GPUs (GeForce 8800 GTX) Peak performance: 16.2 TFLOPS Parallel direct N-body simulation in GraCCA Individual/shared time-step 4 th order Hermite integrator 7.1 TFLOPS GPU/CPU speed-up ~ 200 Ref: Schive, H-Y., et al. 2008, NewA, 13, 418

8 AMR Adaptive-Mesh-Refinement

9 Uniform Mesh Pros Relatively easy to program Relatively easy to parallelize Cons Waste computational time Waste memory Lower resolution

10 Adaptive-Mesh-Refinement (AMR) Resolution adaptively changes with space and time Flexible refinement criteria (e.g., density magnitude)

11 AMR Example Kelvin-Helmholtz instability Refinement criterion: vorticity magnitude V Base level 128 2, refined level 4 2,048 2 effectively Layer 2 Layer 1 Layer 2

12 AMR Example Layer 2 Layer 1 Layer 2

13 GAMER GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics

14 AMR Scheme in GAMER Refinement unit : patch (containing a fixed number of cells, e.g., 8 3 ) Support GPU hydro and gravity solvers Hierarchical oct-tree data structure Patch at refinement level 0 Patch at refinement level 1 Patch at refinement level 2

15 Example : Blast Wave Test 0 1 GPU 2 0 GPU 1 1 Multiprocessor (1) Multiprocessor (2) Multiprocessor (3) GPU 2 GPU 3... Multiprocessor (16) GPU 1

Example : Blast Wave Test Patch 1 Multiprocessor 1 0 1 GPU 2 0 GPU 1 1 Thread 1 Thread 2 Thread 3

16 Example : Blast Wave Test Patch 1 Multiprocessor GPU 2 0 GPU 1 1 Thread 1 Thread 2 Thread 3 Multiprocessor (1) Multiprocessor (2) Multiprocessor (3) Thread 4 GPU 2 GPU 3... Multiprocessor (16) GPU 1

17 Optimization

Wall-clock time (s) CPU vs. CPU + GPU Dominant factors : Fluid & Gravity solvers 800 700 731.

18 Wall-clock time (s) CPU vs. CPU + GPU Dominant factors : Fluid & Gravity solvers x x 24x CPU GPU

19 Percentage (%) Optimization I : Asynchronous Memory Copy Data transfer between CPU and GPU : 27% ~ 34% of the total GPU execution time!! Use CUDA streams to perform memory copy concurrently with kernel execution % % CPU -> GPU GPU kernel GPU -> CPU Total Total with streams 0 Fluid Gravity

20 Wall-clock time (s) Optimization I : Asynchronous Memory Copy (26x) (105x) (100x) CPU GPU

21 Optimization II : OpenMP Fully exploit the multi-core CPU computing power N GPUs + K CPU cores (N K) CPU GPU CPU GPU Core [1] Core [1] OpenMP Core [1] Core [1] Core [2] Core [2] Core [2] Core [2] Core [N] Core [K] Core [N] Core [K]

22 Wall-clock time (s) Optimization II : OpenMP Fully exploit the multi-core CPU computing power N GPUs + K CPU cores (N K) x (105x) 6.6 (100x) (49x) CPU OpenMP GPU

23 Wall-clock time (s) Optimization III : Concurrent Execution between CPU and GPU Invoking GPU kernels and transferring data between CPU and GPU are asynchronous!! x (105x) (100x) (71x) CPU OpenMP GPU Optimized

24 Optimization IV : Space-filling Curve for Domain Decomposition The rectangular domain decomposition can lead to an issue of load imbalance. More load Less load

25 Optimization IV : Space-filling Curve for Domain Decomposition The standard space-filling curve method can be applied to GAMER (not complete yet)

26 Performance

27 Performance : Single GPU NERSC Dirac GPU Cluster GPU: 1 NVIDIA Tesla C2050 CPU: 1 Intel Xeon E x With self-gravity (80x speedup in GPU) and individual time-step 1.38x 1.11x Stream : PCI-E/GPU overlap Async : CPU/GPU overlap OMP(4) : 4 OpenMP threads GAMER-optimized vs. 1 CPU core : 84x 4 CPU cores: 22x

Performance : GPU Cluster NERSC Dirac GPU Cluster GPU: 1-32

self-gravity (80x speed-up in GPU) and individual time-step

4 OpenMP threads 32 GPU vs. 32 CPU cores: 71x 32 GPU vs.

28 Performance : GPU Cluster NERSC Dirac GPU Cluster GPU: 1-32 NVIDIA Tesla C2050 CPU: 1-32 Intel Xeon E5530 With self-gravity (80x speed-up in GPU) and individual time-step Stream : PCI-E/GPU overlap Async : CPU/GPU overlap OMP(4) : 4 OpenMP threads 32 GPU vs. 32 CPU cores: 71x 32 GPU vs. 128 CPU cores: 18x Equivalent to 2,304 CPU cores MPI ~ 11% of T total

29 Applications

30 I : Large-scale Structure 100 h -1 Mpc comoving box Effective resolution: 8,192 3 & 32,768 3 Purely baryonic Dark matter to be added Speed-up: ~70x

31 II : Bosonic Dark Matter Schrö dinger eq. with selfgravity Use GAMER as GPU+AMR framework 10 h -1 Mpc comoving box Effective resolution: 32,768 3 Speed-up: ~40x

32 Conclusion GAMER : GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics A framework of AMR + GPUs general-purpose Hybrid MPI/OpenMP/GPU parallelization (multi CPUs + multi GPUs) 70x ~ 100x speed-up (1 GPU vs. 1 CPU core) GAMER ref : (1) Schive, H-Y., et al. 2010, ApJS, 186, 457 Optimizations (2) arxiv: Asynchronous memory copies Hybrid OpenMP/MPI parallelization Concurrent execution between CPU and GPU Space-filling curve for load balance

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics