An Innovative Massively Parallelized Molecular Dynamic Software

Size: px

Start display at page:

Download "An Innovative Massively Parallelized Molecular Dynamic Software"

Brandon Bridges
5 years ago
Views:

1 Renewable energies Eco-friendly production Innovative transport Eco-efficient processes Sustainable resources An Innovative Massively Parallelized Molecular Dynamic Software Mohamed Hacene, Ani Anciaux, Xavier Rozanska, Paul Fleurat Lessard, Thomas Guignon The CADENCED Project DTIMA VASP on GPU GTC /05/2012

2 CADENCED Project SP3 CADENCED Project Join project with KAUST, CNRS, ENS Lyon and IFPEN Goal: design new catalyst with focus on hydrogen production Sub project 3: improve simulation tools to help new catalyst design Explore GPU computing for MD simulation Develop tools for MD code coupling (Vasp + TurboMole) 2

3 VASP developed by the University of Vienna package for performing ab-initio quantum-mechanical molecular dynamics (MD) using pseudo potentials and a plane wave basis set. VASP implementation approach based on a finite-temperature local-density approximation (with the free energy as variation quantity) and an exact evaluation of the instantaneous electronic ground state at each optimization step Target high performance VASP 5.2 version hybrid (CPU+GPU) version 3

4 GPU methodology (1) Formal view of VASP: Don t care about physic models Care about numeric algorithm (linear algebra, FFT ) work flow and data flow profile VASP to identify the most time consuming functions Move the most expensive functions to GPU Analyze transfers between the CPU and the GPU 4

5 GPU methodology (2) VASP profile shows that majority of time is spent in: FFT BLAS Time consuming functions: EDDAV for the Blocked Davidson method (IALGO=38) EDDIAG, RMM-DIIS, ORTHCH for the RMMDIIS method (IALGO=48) POTLOK and CHARGE functions for both methods 5

6 GPU methodology (3) Step by Step approach: t Inside a function identify computation parts that can move on GPU: CPU BLAS FFT Rewrite them for GPU (or use library): CPU FFT CUFFT, BLAS BLAS, Specific computation loops hand coded kernels GPU CUBLAS CUFFT Introduce them with data transfer before/after each kernel call. CPU Check GPU vs CPU results FFT Easy validity check with CPU version GPU CUBLAS CUFFT Analyze data flow to Remove unnecessary copy. CPU Find asynchronous transfer opportunity. GPU CUBLAS CUFFT 6

7 CPU/GPU Automatic choice CPU computation between 2 GPU calls (HAMILTMU PROJALL) Algorithm not GPU friendly Too small data set (low parallelism) GPU CPU GPU kernel 1 GPU kernel 2 GPU CPU GPU data transfers reduce GPU gains. Move CPU comp. To GPU implies no data transfers. No performance model: Do we reduce computation time? t 7

8 CPU/GPU Automatic choice First iteration: take CPU time GPU GPU kernel 1 GPU kernel 2 CPU t T cpu Second iteration: take GPU time GPU GPU kernel 1 GPU kernel 2 T CPU gpu t Following iterations: take the fastest one GPU computation can be longer but we avoid copy. 8

Results: Test systems Initially developed on a 2xE5420 with S1070 GPU system Tests systems: 1 workstation: 4c Core2 @ 2.

9 Results: Test systems Initially developed on a 2xE5420 with S1070 GPU system Tests systems: 1 workstation: 4c 2.66 Ghz (Q9450) + C2070; 1 bullx system (9 nodes): 2x 4c 2.5 Ghz (E5540) + 2x M1060 Bullx node: Each GPU has it s own PCI Express bus. Care of CPU/GPU affinity. M1060 M1060 9

10 Acceleration, 1 core/1 GPU 10 test cases (algo=fast) SILICA, 240 atoms SLAB, 328 atoms Acceleration: 3.8 to 5.0 vs 1core Nehalem 2.5 Ghz C2070: no significant gain over M1060 Slow host processor Slow PCI express Total time comparison between Xeon E5540, E5540+M1060 and Q9450+C2070, CUDA 3.2 Acceleration factors are given in brackets

11 Acceleration: iteration details 11 SLAB 1 iteration (ialgo=48) acc vs E5540: EDDIAG: M1060: C2070: 12.1 ORTHCH: M1060: C2070: 14.2 RMMDIIS: M1060: 2,3 - C2070: 2,54 Overall acc. is limited by RMMDIIS function Xeon E5540 Tesla M1060 Tesla C EDDIAG CHARGE ORTHCH RMMDIIS

12 GPU / core balance Compare 1 GPU vs 1 core is not really fair: Typical balance is 4 cores for one GPU Consider linear acceleration for cpu version: One GPU does not improve performance compared to 4 cores More dense GPU system (4c 4GPU)? Problems PCI Express scalability Power supply, Thermal dissipation Multiple core VASP with MPI 12

13 Multiple CPUs/GPUs: results (1) SLAB 8 CPUs + 8 GPUs faster than 32 CPUs GPU acc. when mpi processes CPU E5540 (B505) GPU M1060 (B505) For 16 CPUs/16 GPUs acc. is only CPU 4 CPUs 8 CPUs 16 CPUs 32 CPUs VASP: SLAB (328 atoms) multi-gpu (Bullx) CUDA

14 Multiple CPUs/GPUs: results (2) WGPS atoms, Algo=Fast (SP1 test case) CPU E5540 (B505) 8CPUs + 8 GPUs faster than 32 cpus For 16 CPUs/16 GPUs acc. is only GPU M1060 (B505) CPUs 8 CPUs 16 CPUs 32 CPUs WGPS3 (1138 atoms) test on VASP multi-gpu Bullx, CUDA 3 14

15 Beyond initial results Updated system: 2.8 Ghz (WS3530) + C2070 SLAB test case (NSW=1) CPU time 1 core : s. GPU time : 4064 s. Acc. is only 4.33, previous was: 5! The hard point is still RMMDIIS function 15

16 Closer look to RMMDIIS RMMDIIS function : 92.4 s., 74% of iteration time time (s.) % PROJALL 2,30E+01 24,8 Init 4,40E+00 4,8 Hamil 2,93E+01 31,8 ECCP 5,35E+00 5,8 FFTWAV 1,07E+01 11,6 BLAS 1,19E+01 12,9 LAPACK 7,48E-02 0,1 Transfer 7,66E+00 8,3 Only on CPU (see slide 9) 16

17 Performance problem (PROJALL) low parallelism: Number of grid points per ions ( 1000 for SLAB) A solution: use parallelism over ions with cuda stream for PROJALL (RPROMU) RMMDIIS time goes down to 62.8 s. (92.4 s.) Overall simulation time: 3126 s. (4064s) Acceleration is now

18 Performance problem (HAMIL) Similar to PROJALL but: Update real space grid for each ions: no simple parallelism when overlap. Possible solutions: Atomic operations: does not work with double precision, may be inefficient. Finding independent sets of ions: All ions in one set do not overlap. 18

19 Conclusions and Future work (1) Best effort approach for VASP GPU: >10 acceleration factor on some functions RMMDIIS need more improvement. Best effort approach for multicore VASP? (OpenMP, Pthreads)? Can GPU compete with multicore Possible solution: multicore with GPU In a node, balance work between all cores and all gpu 19

20 Conclusions and Future work (2) Benefit from cuda 4.0: Direct data transfer from GPU memory to infiniband network Mixed precision? For some parts, may be single precision is enough. New GPUs cards: M % faster than M2070 Don t use ECC memory 20

21 Question? 21

Enhanced Oil Recovery simulation Performances on New Hybrid Architectures

Renewable energies Eco-friendly production Innovative transport Eco-efficient processes Sustainable resources Enhanced Oil Recovery simulation Performances on New Hybrid Architectures A. Anciaux, J-M.