GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能 Hsi-Yu Schive ( 薛熙于 ), Tzihong Chiueh ( 闕志鴻 ), Yu-Chih Tsai ( 蔡御之 ), Ui-Han Zhang ( 張瑋瀚 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics (LeCosPA) NVIDIA GTC (May 19, 2011)
Outline Introduction to GPU Graphic-Processing-Unit Introduction to AMR Adaptive-Mesh-Refinement GPU + AMR GAMER GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics Optimization and Performance Applications
GPU Graphic-Processing-Unit
Graphic-Processing-Unit (GPU) NVIDIA Quadro 6000 Animations, video games, data visualization
Graphic-Processing-Unit (GPU) NVIDIA Quadro 6000 Astrophysics??
Performance & Bandwidth Performance: GPU vs. CPU ~ 10x Bandwidth: GPU vs. CPU ~ 6x
GPUs + Direct N-body GraCCA system (2006) Graphic-Card Cluster for Astrophysics 16 nodes, 32 GPUs (GeForce 8800 GTX) Peak performance: 16.2 TFLOPS Parallel direct N-body simulation in GraCCA Individual/shared time-step 4 th order Hermite integrator 7.1 TFLOPS GPU/CPU speed-up ~ 200 Ref: Schive, H-Y., et al. 2008, NewA, 13, 418
AMR Adaptive-Mesh-Refinement
Uniform Mesh Pros Relatively easy to program Relatively easy to parallelize Cons Waste computational time Waste memory Lower resolution
Adaptive-Mesh-Refinement (AMR) Resolution adaptively changes with space and time Flexible refinement criteria (e.g., density magnitude)
AMR Example Kelvin-Helmholtz instability Refinement criterion: vorticity magnitude V Base level 128 2, refined level 4 2,048 2 effectively Layer 2 Layer 1 Layer 2
AMR Example Layer 2 Layer 1 Layer 2
GAMER GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics
AMR Scheme in GAMER Refinement unit : patch (containing a fixed number of cells, e.g., 8 3 ) Support GPU hydro and gravity solvers Hierarchical oct-tree data structure Patch at refinement level 0 Patch at refinement level 1 Patch at refinement level 2
Example : Blast Wave Test 0 1 GPU 2 0 GPU 1 1 Multiprocessor (1) Multiprocessor (2) Multiprocessor (3) GPU 2 GPU 3... Multiprocessor (16) GPU 1
Example : Blast Wave Test Patch 1 Multiprocessor 1 0 1 GPU 2 0 GPU 1 1 Thread 1 Thread 2 Thread 3 Multiprocessor (1) Multiprocessor (2) Multiprocessor (3) Thread 4 GPU 2 GPU 3... Multiprocessor (16) GPU 1
Optimization
Wall-clock time (s) CPU vs. CPU + GPU Dominant factors : Fluid & Gravity solvers 800 700 731.3 600 500 400 300 200 356.4 81x 349.7 76x 24x CPU GPU 100 0 11.4 4.4 4.6 30.4 6.6 0.27 0.18 1.1
Percentage (%) Optimization I : Asynchronous Memory Copy Data transfer between CPU and GPU : 27% ~ 34% of the total GPU execution time!! Use CUDA streams to perform memory copy concurrently with kernel execution 300 250 200 150 150 226 173 23% 217 298 239 20% CPU -> GPU GPU kernel GPU -> CPU 100 50 44 57 29 24 Total Total with streams 0 Fluid Gravity
Wall-clock time (s) Optimization I : Asynchronous Memory Copy 30 25 27.9 (26x) 20 15 10 5 0 11.4 (105x) 3.4 6.6 (100x) 3.5 0.27 0.18 1.1 CPU GPU
Optimization II : OpenMP Fully exploit the multi-core CPU computing power N GPUs + K CPU cores (N K) CPU GPU CPU GPU Core [1] Core [1] OpenMP Core [1] Core [1] Core [2] Core [2] Core [2] Core [2] Core [N] Core [K] Core [N] Core [K]
Wall-clock time (s) Optimization II : OpenMP Fully exploit the multi-core CPU computing power N GPUs + K CPU cores (N K) 30 27.9 25 1.87x 20 15 10 5 0 11.4 (105x) 6.6 (100x) 3.4 3.4 3.5 2.4 0.27 0.18 1.1 14.9 (49x) CPU OpenMP GPU
Wall-clock time (s) Optimization III : Concurrent Execution between CPU and GPU Invoking GPU kernels and transferring data between CPU and GPU are asynchronous!! 30 27.9 25 2.71x 20 15 10 5 0 11.4 (105x) 6.6 3.4 3.4 2.4 (100x) 3.5 0.27 0.18 1.1 14.9 10.3 (71x) CPU OpenMP GPU Optimized
Optimization IV : Space-filling Curve for Domain Decomposition The rectangular domain decomposition can lead to an issue of load imbalance. More load Less load
Optimization IV : Space-filling Curve for Domain Decomposition The standard space-filling curve method can be applied to GAMER (not complete yet)
Performance
Performance : Single GPU NERSC Dirac GPU Cluster GPU: 1 NVIDIA Tesla C2050 CPU: 1 Intel Xeon E5530 2.25x With self-gravity (80x speedup in GPU) and individual time-step 1.38x 1.11x Stream : PCI-E/GPU overlap Async : CPU/GPU overlap OMP(4) : 4 OpenMP threads GAMER-optimized vs. 1 CPU core : 84x 4 CPU cores: 22x
Performance : GPU Cluster NERSC Dirac GPU Cluster GPU: 1-32 NVIDIA Tesla C2050 CPU: 1-32 Intel Xeon E5530 With self-gravity (80x speed-up in GPU) and individual time-step Stream : PCI-E/GPU overlap Async : CPU/GPU overlap OMP(4) : 4 OpenMP threads 32 GPU vs. 32 CPU cores: 71x 32 GPU vs. 128 CPU cores: 18x Equivalent to 2,304 CPU cores MPI ~ 11% of T total
Applications
I : Large-scale Structure 100 h -1 Mpc comoving box Effective resolution: 8,192 3 & 32,768 3 Purely baryonic Dark matter to be added Speed-up: ~70x
II : Bosonic Dark Matter Schrö dinger eq. with selfgravity Use GAMER as GPU+AMR framework 10 h -1 Mpc comoving box Effective resolution: 32,768 3 Speed-up: ~40x
Conclusion GAMER : GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics A framework of AMR + GPUs general-purpose Hybrid MPI/OpenMP/GPU parallelization (multi CPUs + multi GPUs) 70x ~ 100x speed-up (1 GPU vs. 1 CPU core) GAMER ref : (1) Schive, H-Y., et al. 2010, ApJS, 186, 457 Optimizations (2) arxiv: 1103.3373 Asynchronous memory copies Hybrid OpenMP/MPI parallelization Concurrent execution between CPU and GPU Space-filling curve for load balance