Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Size: px

Start display at page:

Download "Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics"

Baldric Booth
5 years ago
Views:

1 Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics (LeCosPA) T. Chiueh ( 闕志鴻 ), Y. C. Tsai ( 蔡御之 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics (LeCosPA) Workshop on GPU Supercomputing (1/16/2009)

2 GPU Applications From the smallest scale (QCD, Quantum Spin System) to the largest scale (Astrophysics & Cosmology)

3 Outline Introduction GraCCA (Graphic-Card Cluster for Astrophysics) system and previous work AMR Hydrodynamics + Self-Gravity Simulation in GPUs Conclusion and Future Work

4 Introduction : GPU vs. CPU Faster, Faster, Faster!!! GPU : low clock rate, multi-processors GTX GHz, 240 processors 30 multiprocessors : each has 16 KB fast shared memory ~ 933 GFLOPS CPU : high clock rate, few processors Intel Core 2 Quad Q GHz, quad-core ~ 40 GFLOPS 23 times faster

5 Programming interface : CUDA (Compute Unified Device Architecture) GPU multithreaded coprocessor to CPU Execute thousands of threads in parallel All threads execute the same kernel Kernel Thread (1) Thread (2)... Thread (N) Processor (1) Processor (2)... Processor (128) GPU

6 GraCCA Graphic-Card Cluster for Astrophysics

7 Architecture 18 nodes, 36 GPUs Theoretical performance : 518.4*36 = 18.7 TFLOPS Network : gigabit Ethernet Hardwares in each node Hardware Model Amount Graphic Card NVIDIA GeForce 8800 GTX 2 Motherboard Gigabyte GA-M59SLI S5 1 CPU AMD Athlon 64 X Power Supply Thermaltake Toughpower 750W 1 RAM DDR GB RAM 4 Hard Disk Seagate 80G SATAII 1

8 Architecture Gigabit Network Switch 1 Gigabit Network Card PC Memory (DDR2-667, 2 GB) PCI-Express x16 CPU PCI-Express x16 Node 18 GPU Memory GPU Memory (GDDR3, 768 MB) (GDDR3, 768 MB) G80 GPU G80 GPU Graphic Card 1 Graphic Card 2 Node 1

9 Photos of GraCCA Multi-node Single-node

10 Previous Work : Parallel Direct N-body Simulation ~ Schive et al., NewA 13, 418. Speed (GFLOPS) 1.E+04 1.E+03 1.E x speed-up over a single CPU Ngpu = 1 Ngpu = 2 Ngpu = 4 Ngpu = 8 Ngpu = 16 Ngpu = 32 for N = 1024k : Single GPU : 257 GFLOPS 32 GPUs : 6.62 TFLOPS 1.E+01 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 N

11 Core Collapse in Globular Cluster Initial condition : Plummer s model It took about one month for the N=64k case One of a few groups having the computation capability for simulating the core collapse for N=64k 14 N = 64K N = 32K 12 log( Core Density ) N = 16K N = 8K Scaled N body time

12 AMR Hydrodynamics Simulation in GPUs

13 PDE in Hydrodynamics Conservation laws of mass, momentum, and energy ρ t ( ρ t ( E) t + ( r v) ρ r v) + ( ρ = 0 rr vv + P) = r + ( v( E+ P)) = ρ ρ φ r v φ ρ: density v : velocity P : pressure ψ: potential E : energy density

14 Adaptive-Mesh-Refinement Boring Region : flat, empty, low error coarse mesh Interesting Region : high density, high contrast, high error fine mesh

15 Example : Sedov-Taylor Blast Wave Density spherical shock compression ratio ~ refine levels ( )

16 Sedov-Taylor Blast Wave Density

17 Basic Scheme 2 nd -order TVD scheme for the fluid solver SOR method for the Poisson solver Hierarchical oct-tree data-structure Basic unit : patch ( fixed number of grids ) Patch in level 0 (2*2 grids) Patch in level 1 (2*2 grids) Patch in level 2 (2*2 grids)

18 GPU Acceleration Two main tasks in the AMR program: 1. Patch construction : decision making, interpolation, complex data-structure, data assignment ~ complicated, but time-saving CPU 2. 3-D hydrodynamics + Poisson solver : ~ straightforward, but time-consuming GPU feed with hundreds of patches simultaneously

19 Parallel Evaluation of Multi- Patches in a Single GPU Multiprocessor (1) Multiprocessor (2) 1 Multiprocessor (3)... Multiprocessor (16) GPU

20 Concurrent Execution in CPU and GPU Preparing data for the GPU fluid solver (data copy, interpolation ) is also very time-consuming!! Hide this preparing time by the asynchronous execution in GPU time CPU Prepare patch 2 Prepare patch 3 GPU Evaluate patch 1 Evaluate patch 2

21 Concurrent Memory Copy and Kernel Execution The bandwidth between CPU and GPU is only 4 GB/s just not high enough!!! Hide this data-transferring time by the concurrent memory copy (between CPU and GPU ) and execution in GPU time 16x PCI-E Transfer patch 2 Transfer patch 3 GPU Evaluate patch 1 Evaluate patch 2

22 Performance (hydrodynamics only) Single GPU vs. single CPU (64 3, 128 3, 256 3, ) x speed-up speed-up ratio simulation size

23 Poisson Solver in GPU Successive Over-Relaxation method (SOR) Given the boundary condition, the SOR method will iteratively approach the solution of the Poisson equation The patch with 8 3 grids can be perfectly fit into the shared memory of GPU (16 KB per multiprocessor in the GeForce 8800 GTX) only need to transfer data between global memory and shared memory before and after the iteration loop more iterations, higher performance

24 Performance of the SOR in GPU Single GPU vs. single CPU 17.5x speed-up for iterations ~ speed-up ratio iteration

25 Multi-GPUs Each CPU and GPU handle a sub-domain Exchanging data by MPI CPU 0 GPU 0 CPU 2 GPU 2 Data-transfer (gigabit-network) CPU 1 GPU 1 CPU 3 GPU 3

26 Network Bandwidth The computation is highly improved, but the communication is NOT!! Gigabit Ethernet bandwidth ~ only 128 MB/s We must minimize the amount of data to be transferred!!! possible direction for data transfer

27 Performance (multi GPUs) 512^3 run : 8 GPUs vs. 8 CPUs: 10.0x speed-up 1024^3 run : 8 GPUs vs. 8 CPUs: 9.5x speed-up speed-up ratio ^3 run Measured Ideal number of GPUs

28 Demo : Kelvin-Helmholtz Instability

29 Performance in the state-of-art GPU Performance in the GTX 280 GPU Hydrodynamics solver : 1192 ms 638 ms Poisson solver : 336 ms 154 ms The performance is further improved by a factor of 2 But the speed-up ratio of an upgraded GPU over an upgraded CPU is about the same

30 Conclusion and Future Work Parallel GPUs-accelerated AMR hydrodynamics program 1 GPU vs. 1 CPU : 12.3x speed-up 8 GPUs vs. 8 CPUs : 10.0x speed-up GPU-accelerated Poisson solver in GPU 17.5x speed-up for 40 iterations Future work Complete the Poisson solver Dark matter particles Load balance MHD Optimization in the latest GPU (GTX 280, Tesla S1070)

GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能

GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能 Hsi-Yu Schive ( 薛熙于 ), Tzihong Chiueh ( 闕志鴻 ), Yu-Chih Tsai ( 蔡御之 ), Ui-Han Zhang ( 張瑋瀚 ) Graduate Institute