VASP Accelerated with GPUs

Size: px

Start display at page:

Download "VASP Accelerated with GPUs"

Bethanie Singleton
5 years ago
Views:

1 VASP Accelerated with GPUs Capabilities, Methods, and Road-Map Max Hutchinson University of Chicago; Carnegie Mellon University GTC, May 17th, 2012 Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 1 / 44

2 Acknowledgements The rest of our team: Michael Widom James Komianos The real VASP team: Georg Kresse Martijn Marsman Jürgen Hafner This work was supported by the PETTT project PP-CCM-KY P3. This research was supported in part by the National Science Foundation through TeraGrid resources provided by Pittsburgh Supercomputing Center. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 2 / 44

3 Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 3 / 44

4 References M. Hutchinson, M. Widom, VASP on a GPU: Application to exact-exchange calculations of the stability of lemental boron, Computer Physics Communications, Volume 183, Issue 7, July 2012, Pages Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 4 / 44

5 Table of Contents 1 Context Motivating Science DFT and VASP 2 Capabilities and Performance Low-Level Ports High-Level Ports System Capabilities, Requirements 3 Design Decisions and Methods Guiding Principles Development Cycle Examples Tips 4 Road-Map Our plans Your part Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 5 / 44

6 Context Table of Contents 1 Context Motivating Science DFT and VASP 2 Capabilities and Performance Low-Level Ports High-Level Ports System Capabilities, Requirements 3 Design Decisions and Methods Guiding Principles Development Cycle Examples Tips 4 Road-Map Our plans Your part Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 6 / 44

7 Context Motivating Science Quantum Chemistry Hard Condensed Matter Modern model for atomic physics has non-classical elements Electron correlation, exchange energy Discretization of energy, angular momentum Practical understanding of some materials requires quantum models Nano-scale electronics Surface effects High-resolution spectroscopy Low-temperature structure Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 7 / 44

8 Scientific Perspective Context DFT and VASP Start by approximating n-body quantum system with the single-particle Kohn-Sham equation. Density functional theory (DFT) approximates correlation and exchange energies as functionals of the electron density. Functionals form a ladder of increasing accuracy and computational cost. Eigenvalue solvers then used to find the wave-functions. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 8 / 44

One example: Boron Context DFT and VASP The low temperature structure of elemental boron is not known. E βα E β α LDA 47.83 15.48 PBE 26.63-0.17 PKZB 37.02 8.53 HF 46.74 8.

9 One example: Boron Context DFT and VASP The low temperature structure of elemental boron is not known. E βα E β α LDA PBE PKZB HF Table: Table of structural energies (units mev/atom). Here β refers to the ideal hr105 structure, β refers to the 107 atom optimized variant of B.hR141. Energies of α are obtained from the super cell hr12x8. All values are given for the 3x3x3 k-point mesh. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 9 / 44

10 Context DFT and VASP Computational Perspective DFT is nominally O(n 2 lnn) or O(n 3 ), depending on system size. Excact-exchange is more expensive: O(n 3 lnn) or O(n 4 ). Operations have high fine-grain data parallelism BLAS FFT Scatter-Gather Iterations are long (order second) All adds up to a great GPU candidate Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 10 / 44

11 Capabilities and Performance Table of Contents 1 Context Motivating Science DFT and VASP 2 Capabilities and Performance Low-Level Ports High-Level Ports System Capabilities, Requirements 3 Design Decisions and Methods Guiding Principles Development Cycle Examples Tips 4 Road-Map Our plans Your part Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 11 / 44

12 FFT Port Capabilities and Performance Low-Level Ports FFT s contribute 30-50% of CPU time. FFT calls funneled through kernels (4 of them) Previously used to switch between FFTW and custom FFTs Simple copy, compute, copy-back used Cores CPU + 1 GPU Ratio Table: PdO benchmark (87 ions, 496 bands, 822 electrons) on Dirac (NERSC) Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 12 / 44

13 Capabilities and Performance Low-Level Ports BLAS Port BLAS calls contribute 15-40% of CPU time. BLAS calls are made inline, but there aren t too many important ones Again, simple copy, compute, copy-back used Performance was poor (20% worse), so this was abandoned early on. Advances in CUBLAS might make this profitable Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 13 / 44

14 Capabilities and Performance High-Level Ports Exact-Exchange (HF) Port Hybrid functionals, or exact-exchange, are very intensive > 98% of runtime Factor of 2 in memory use Includes interaction between bands Add a linear order to previous complexities VASP implementation is somewhat compartmentalized Calls funnel through two routines Once per k-point per iteration Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 14 / 44

15 Capabilities and Performance High-Level Ports HF Port Performance Workstation vs Workstation Structure hr12 hr12x8 hr105 Platform cpu gpu cpu gpu cpu gpu FOCK ACC (s) , , FOCK FORCE (s) , , , ,435.5 Other (s) Overall (hr) Speedup 5.82x 12.39x 20.41x Table: Run-times of components of VASP exact-exchange runs. Overall times are projected assuming a total of 5 ionic minimization steps and 75 electronic minimization steps. CPU runs are single-core and GPU runs are single-device. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 15 / 44

16 Plots Capabilities and Performance High-Level Ports Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 16 / 44

17 HF Port Performance Workstation vs Supercomputer Capabilities and Performance High-Level Ports Struct. k T-1C1G T-2C2G B-16C B-32C B-64C B-128C hr hr12x8 2 1, , , , ,160.3 hr , , , , , ,221.0 hr , , , , , ,817.5 ap , , , , , ,816.5 Table: Actual run-times of truncated runs, reduced NELM and NSW, of different structures on different platforms. T is tirith, B is blacklight, attributes mcng indicates m CPU cores and n GPU devices. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 17 / 44

18 Capabilities and Performance System Capabilities, Requirements Other Capabilities Compute capability 2.0 or higher Arbitrary CPU:GPU ratios Round-robin Uses File I/O (I m sorry) Mixed or full double precision FFTs in single or double Everything else in double Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 18 / 44

19 Design Decisions and Methods Table of Contents 1 Context Motivating Science DFT and VASP 2 Capabilities and Performance Low-Level Ports High-Level Ports System Capabilities, Requirements 3 Design Decisions and Methods Guiding Principles Development Cycle Examples Tips 4 Road-Map Our plans Your part Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 19 / 44

20 Guiding Principles Design Decisions and Methods Guiding Principles 1 Performance: ultimately, this is our primary concern Intercept high in the call tree Write/use good kernels 2 Programmability: programmer time is a limited quantity Be maximally compartmental, minimally intrusive Don t get too clever 3 Portability: why write something that can t be used? Use standard languages (FORTRAN, C[, Python]) Use standard libraries (CUBLAS, CUFFT) Don t add system assumptions Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 20 / 44

21 Design Decisions and Methods Guiding Principles CPU Profile Optimize Translate Profile Validate Optimize GPU Debug Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 21 / 44

22 Design Decisions and Methods Development Cycle Incremental Ports Our technique has been to climb up callgraphs. Pros: Important work is done first Debugging is [more] palatable Provides rough numerical validation Cons: Divergent efforts can require merges Inherit high-level structure from CPU code Perturbation method. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 22 / 44

23 Design Decisions and Methods Development Cycle CPU Profile Optimize Translate Profile Validate Optimize GPU Debug Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 23 / 44

24 Design Decisions and Methods Development Cycle Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 24 / 44

25 Design Decisions and Methods Development Cycle Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 25 / 44

26 Design Decisions and Methods Development Cycle Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 26 / 44

27 Design Decisions and Methods Development Cycle CPU Profile Optimize Translate Profile Validate Optimize GPU Debug Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 27 / 44

28 Intercepts Design Decisions and Methods Development Cycle #ifdef CUDA / Assumptions / USE CUDA = ( condition1 && condition2 &&... ); if ( USE CUDA ) { fun cu(foo, bar) // intercept (not a kernel ) } else { #endif / Function to be intercepted / fun(foo, bar) #ifdef CUDA } #endif Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 28 / 44

29 Design Decisions and Methods Development Cycle CPU Profile Optimize Translate Profile Validate Optimize GPU Debug Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 29 / 44

30 Validation Design Decisions and Methods Development Cycle./vasp_test.py -e../exes/vasp-pgk -t PdO-v/ -n 1 ====================================================== Test Name: PdO-v/ Run on: In:./tests/3F0T Result Parameter Test vs Expected passed energy e+02 vs e+02 passed ext. pressure e+02 vs e+02 passed volume e+03 vs e+03 passed stress (xx) e+02 vs e+02 passed stress (yy) e+02 vs e+02 passed stress (zz) e+02 vs e+02 passed stress (xy) e+00 vs e+00 passed stress (yz) e+00 vs e+00 passed stress (zx) e+00 vs e x loop time vs Max 0.95x Hutchinson setdij (UChicago and time CMU) GPU VASP vs GTC 5/17/12 30 / 44

31 Design Decisions and Methods Development Cycle CPU Profile Optimize Translate Profile Validate Optimize GPU Debug Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 31 / 44

32 Tot Num Avg % "method=": A_kernel: gemm: double_: crrexp_mul_wave_k: aug_charge_trace_k: mul_vec_k: charge_trace_k: racc0_combine_k: calc_dllmm_k: apply_gfac_der_k: apply_gfac_k: eccp_nl_fock: memcpy: rpro1_combine_k: split_complex_k: else: Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 32 / 44 CUDA Profiler Design Decisions and Methods Development Cycle

33 CUDA Profiler Design Decisions and Methods Development Cycle Tot Num Avg % "method=": memcpy: A_kernel: B_kernel: memset32: else: gemm: crrexp_mul_wave_k: racc0_combine_k: charge_trace_k: aug_charge_trace_k: apply_gfac_der_k: apply_gfac_k: eccp_nl_fock: double_: mul_vec_k: rpro1_combine_k: Max Hutchinson split_complex_k: (UChicago and CMU) GPU 0.0VASP 0 GTC 0.0 5/17/ / 44

34 Design Decisions and Methods Development Cycle CPU Profile Optimize Translate Profile Validate Optimize GPU Debug Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 34 / 44

35 Persistent pointers Design Decisions and Methods Examples / void pointer / typedef struct void p{ unsigned int size ; void ptr ; } void p ; / double pointer / typedef struct double p{ unsigned int size ; double ptr ; } double p ; Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 35 / 44

36 Persistent pointers Design Decisions and Methods Examples / Assign a chunk of GPU mem to a chunck of CPU mem / static inline void assign cu ( void p dest, //!< destina void src, //!< source unsigned int size //<! size ( i ){ / Do we need to resize? / if (dest >ptr == NULL dest >size < size ){ if (dest >ptr!= NULL) cudafree(dest >ptr ); cudamalloc(( void )&dest >ptr, size ); dest >size = size ; } / Do the actual copy / cudamemcpy(dest >ptr, src, size, cudamemcpyhosttodevice); } Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 36 / 44

37 Structs Design Decisions and Methods Examples typedef struct 4vector{ int t ; int x; int y; int z; } 4vector events [N]; Improves locality for elemental functions. Mechanism is deep memory caches. typedef struct 4vectors{ int t [N]; int x[n]; int y[n]; int z[n]; } 4vectors events ; Improves memory bandwidth for vector functions. Mechanism is wide memory bus. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 37 / 44

38 Design Decisions and Methods Tips Intercepts vs Overhauls Intercepts and overhauls have the same theoretical peak performance. Maximal intercept is 2 codes One is usually easier than the other. Difficulty of intercepts is governed by Loop position: must intercept above fine-grain loops Data structures: must pass data and context to GPU Difficulty of overhauls is governed by Size, complexity of auxiliary code State of the original code Overhaul has side-benefits. Intercepts have side-costs. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 38 / 44

39 Road-Map Table of Contents 1 Context Motivating Science DFT and VASP 2 Capabilities and Performance Low-Level Ports High-Level Ports System Capabilities, Requirements 3 Design Decisions and Methods Guiding Principles Development Cycle Examples Tips 4 Road-Map Our plans Your part Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 39 / 44

40 Road-Map Our plans Non-HF Port Port will use the same scheme as HF port Climbing up may of the non-hf versions of CPU routines Trying to get all the way up to minimization routine (e.g. RMM-DIIS) You can expect performance approaching HF performance Less parallelism for systems of the same size More rapid iteration Mitigated by larger quantum systems Our goal is beta by sometime this summer Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 40 / 44

41 Road-Map Our plans Merge with VASP Core Our code is generally available to VASP license holders Must request access through Vienna Distribution through our website and git repo This scheme is inadequate (doesn t scale). We hope to put the ports in VASP 5.3, which will have some other architectural changes. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 41 / 44

42 Road-Map Your part Wish List Users, to do science It s all about science Find the kink s in our implementation Input, to direct effort and validate results Scientifically relevant systems Requests for functionality Effort, to write the ports Current VASP users with time to contribute VASP is a large code Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 42 / 44

43 Road-Map Your part Conclusions We ve ported HF functionality in VASP to CUDA. Up to 20x performance over singe core Up to 64 core performance compared to supercomputers Callgraph climbing port method is effective Accelerate specific functionality of large codes Can inform future decisions about dedicated ports Accelerating scientific codes enables new science. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 43 / 44

44 Road-Map Your part Thank you Questions? Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 44 / 44

STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC. Stefan Maintz, Dr. Markus Wetzstein

STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC Stefan Maintz, Dr. Markus Wetzstein smaintz@nvidia.com; mwetzstein@nvidia.com Companies Academia VASP USERS AND USAGE 12-25% of CPU cycles @ supercomputing