Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics (LeCosPA) T. Chiueh ( 闕志鴻 ), Y. C. Tsai ( 蔡御之 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics (LeCosPA) Workshop on GPU Supercomputing (1/16/2009)
GPU Applications From the smallest scale (QCD, Quantum Spin System) to the largest scale (Astrophysics & Cosmology)
Outline Introduction GraCCA (Graphic-Card Cluster for Astrophysics) system and previous work AMR Hydrodynamics + Self-Gravity Simulation in GPUs Conclusion and Future Work
Introduction : GPU vs. CPU Faster, Faster, Faster!!! GPU : low clock rate, multi-processors GTX 280 1.30 GHz, 240 processors 30 multiprocessors : each has 16 KB fast shared memory ~ 933 GFLOPS CPU : high clock rate, few processors Intel Core 2 Quad Q9300 2.5 GHz, quad-core ~ 40 GFLOPS 23 times faster
Programming interface : CUDA (Compute Unified Device Architecture) GPU multithreaded coprocessor to CPU Execute thousands of threads in parallel All threads execute the same kernel Kernel Thread (1) Thread (2)... Thread (N) Processor (1) Processor (2)... Processor (128) GPU
GraCCA Graphic-Card Cluster for Astrophysics
Architecture 18 nodes, 36 GPUs Theoretical performance : 518.4*36 = 18.7 TFLOPS Network : gigabit Ethernet Hardwares in each node Hardware Model Amount Graphic Card NVIDIA GeForce 8800 GTX 2 Motherboard Gigabyte GA-M59SLI S5 1 CPU AMD Athlon 64 X2 3800 1 Power Supply Thermaltake Toughpower 750W 1 RAM DDR2-667 2GB RAM 4 Hard Disk Seagate 80G SATAII 1
Architecture Gigabit Network Switch 1 Gigabit Network Card......... PC Memory (DDR2-667, 2 GB) PCI-Express x16 CPU PCI-Express x16 Node 18 GPU Memory GPU Memory (GDDR3, 768 MB) (GDDR3, 768 MB) G80 GPU G80 GPU Graphic Card 1 Graphic Card 2 Node 1
Photos of GraCCA Multi-node Single-node
Previous Work : Parallel Direct N-body Simulation ~ Schive et al., 2008. NewA 13, 418. Speed (GFLOPS) 1.E+04 1.E+03 1.E+02 250x speed-up over a single CPU Ngpu = 1 Ngpu = 2 Ngpu = 4 Ngpu = 8 Ngpu = 16 Ngpu = 32 for N = 1024k : Single GPU : 257 GFLOPS 32 GPUs : 6.62 TFLOPS 1.E+01 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 N
Core Collapse in Globular Cluster Initial condition : Plummer s model It took about one month for the N=64k case One of a few groups having the computation capability for simulating the core collapse for N=64k 14 N = 64K N = 32K 12 log( Core Density ) 10 8 6 N = 16K N = 8K 4 2 0 0 100 200 300 400 500 600 700 800 900 Scaled N body time
AMR Hydrodynamics Simulation in GPUs
PDE in Hydrodynamics Conservation laws of mass, momentum, and energy ρ t ( ρ t ( E) t + ( r v) ρ r v) + ( ρ = 0 rr vv + P) = r + ( v( E+ P)) = ρ ρ φ r v φ ρ: density v : velocity P : pressure ψ: potential E : energy density
Adaptive-Mesh-Refinement Boring Region : flat, empty, low error coarse mesh Interesting Region : high density, high contrast, high error fine mesh
Example : Sedov-Taylor Blast Wave 0 1 2 Density spherical shock compression ratio ~ 3.5 1 3 refine levels (128 3 512 3 )
Sedov-Taylor Blast Wave Density
Basic Scheme 2 nd -order TVD scheme for the fluid solver SOR method for the Poisson solver Hierarchical oct-tree data-structure Basic unit : patch ( fixed number of grids ) Patch in level 0 (2*2 grids) Patch in level 1 (2*2 grids) Patch in level 2 (2*2 grids)
GPU Acceleration Two main tasks in the AMR program: 1. Patch construction : decision making, interpolation, complex data-structure, data assignment ~ complicated, but time-saving CPU 2. 3-D hydrodynamics + Poisson solver : ~ straightforward, but time-consuming GPU feed with hundreds of patches simultaneously
Parallel Evaluation of Multi- Patches in a Single GPU 0 1 2 Multiprocessor (1) Multiprocessor (2) 1 Multiprocessor (3)... Multiprocessor (16) GPU
Concurrent Execution in CPU and GPU Preparing data for the GPU fluid solver (data copy, interpolation ) is also very time-consuming!! Hide this preparing time by the asynchronous execution in GPU time CPU Prepare patch 2 Prepare patch 3 GPU Evaluate patch 1 Evaluate patch 2
Concurrent Memory Copy and Kernel Execution The bandwidth between CPU and GPU is only 4 GB/s just not high enough!!! Hide this data-transferring time by the concurrent memory copy (between CPU and GPU ) and execution in GPU time 16x PCI-E Transfer patch 2 Transfer patch 3 GPU Evaluate patch 1 Evaluate patch 2
Performance (hydrodynamics only) Single GPU vs. single CPU (64 3, 128 3, 256 3, 512 3 ) 14 12 12.3x speed-up speed-up ratio 10 8 6 4 2 0 0 100 200 300 400 500 600 simulation size
Poisson Solver in GPU Successive Over-Relaxation method (SOR) Given the boundary condition, the SOR method will iteratively approach the solution of the Poisson equation The patch with 8 3 grids can be perfectly fit into the shared memory of GPU (16 KB per multiprocessor in the GeForce 8800 GTX) only need to transfer data between global memory and shared memory before and after the iteration loop more iterations, higher performance
Performance of the SOR in GPU Single GPU vs. single CPU 17.5x speed-up for iterations ~ 40 25 20 speed-up ratio 15 10 5 0 10 100 1000 iteration
Multi-GPUs Each CPU and GPU handle a sub-domain Exchanging data by MPI CPU 0 GPU 0 CPU 2 GPU 2 Data-transfer (gigabit-network) CPU 1 GPU 1 CPU 3 GPU 3
Network Bandwidth The computation is highly improved, but the communication is NOT!! Gigabit Ethernet bandwidth ~ only 128 MB/s We must minimize the amount of data to be transferred!!! possible direction for data transfer
Performance (multi GPUs) 512^3 run : 8 GPUs vs. 8 CPUs: 10.0x speed-up 1024^3 run : 8 GPUs vs. 8 CPUs: 9.5x speed-up speed-up ratio 9 8 7 6 5 4 3 2 1 0 512^3 run Measured Ideal 0 2 4 6 8 10 number of GPUs
Demo : Kelvin-Helmholtz Instability
Performance in the state-of-art GPU Performance in the GTX 280 GPU Hydrodynamics solver : 1192 ms 638 ms Poisson solver : 336 ms 154 ms The performance is further improved by a factor of 2 But the speed-up ratio of an upgraded GPU over an upgraded CPU is about the same
Conclusion and Future Work Parallel GPUs-accelerated AMR hydrodynamics program 1 GPU vs. 1 CPU : 12.3x speed-up 8 GPUs vs. 8 CPUs : 10.0x speed-up GPU-accelerated Poisson solver in GPU 17.5x speed-up for 40 iterations Future work Complete the Poisson solver Dark matter particles Load balance MHD Optimization in the latest GPU (GTX 280, Tesla S1070)