HPC Application Porting to CUDA at BSC

Size: px

Start display at page:

Download "HPC Application Porting to CUDA at BSC"

Pauline Richardson
6 years ago
Views:

1 HPC Application Porting to CUDA at BSC Pau Farré, Marc Jordà GTC San Jose

2 Agenda WARIS-Transport Atmospheric volcanic ash transport simulation Computer Applications department PELE Protein-drug interaction simulation Life Sciences department 2

3 WARIS-Transport Volcano ash dispersion simulation

4 Motivation Forecast of atmospheric transport and deposition of volcanic ash Meteorological models VAAC: Volcanic Ash Advisory centers Controlling volcano eruptions Help airliners Redirect flights 4

5 Eruptions Eyajfajallajökull eruption (Iceland, 2010) 48% cancelled flights in europe during a week ( flights) Over 1.3 billions in losses Puyehue-Cordon Caullé eruption (Chile, 2011) Multiple flights cancelled in Chile Argentina South-Africa Australia Ash extension map Airspace shutdown Ash extension map 5

6 Description Rectangular Cartesian Grid (x,y,z) Factors controlling atmospheric transport: Wind advection Turbulent diffusion Gravitational settling of particles General Advection-Diffusion-Reaction Eq. Custom Jacobi Stencil output stencil 6

7 Algorithm Finite difference method: Iterative process Main computation Advection-Diffusion-Reaction 7

8 CUDA Implementation (I) 1. Advection-Diffusion-Reaction Kernel ~80% CPU execution time 8

9 CUDA Implementation (II) 1. Advection-Diffusion-Reaction Kernel 2. Compute Terminal Velocity Meteorological computations 9

10 CUDA Implementation (III) 1. Advection-Diffusion-Reaction Kernel 2. Compute Terminal Velocity 3. Implement all non-io computations in GPU Minimize CPU GPU copies 10

11 CUDA Implementation (IV) 1. Advection-Diffusion-Reaction Kernel 2. Compute Terminal Velocity 3. Implement all non-io computations in GPU 4. Different particles sizes are launched in different streams 11

12 Kernel Overlap Chile-2011 dataset 0.25º (grid size 121x121x64) Chile-2011 dataset 0.05º (grid size 601x601x64) Some datasets are too small to fully occupy all SMs with only one kernel Parallel kernel execution to fully occupy all SMs 12

13 Results Implementations: MPI + AVX MPI + OpenMP + AVX MIC (MPI+OpenMP+AVX) MPI + CUDA (1 GPU/rank) Chile 2011 dataset 0.05º Marenostrum supercomputer 16x cores/node 2x Intel MIC GPU Server: 4x Nvidia Tesla K40 4 GPU runs as fast as 8 Marenostrum3 nodes (128 cores) 13

14 PELE: Protein Energy Landscape Exploration Interactive Drug Design with Monte Carlo Simulations

15 PELE Vision Drug design is a costly process Design through Interactive biomolecular simulations Statistical approach Faster simulations Visual analysis Computational power + human intuition PELE-GUI 15

refinement to a more stable position (energy minimization) Acceptance test If

16 PELE: Protein Energy Landscape Exploration Monte Carlo approach where each trial does: Perturbation Protein shape + ligand position Relaxation Further refinement to a more stable position (energy minimization) Acceptance test If accepted, used as inital conformation for future trials Relaxation Perturbation 16

17 PELE Demo 17

18 PELE Energy Formula Initial profiling Energy computation was the most time consuming task Exec. time cost of energy terms Bond Energy: 1.27% Angle Energy: 0.93% Dihedral Energy: 2.13% Non-bonding Interactions Electrostatic Lennard Jones Solvent Energy Total: 37,58 % Update alphas: 27.96% 18

19 PELE Energy Formula Initial profiling Energy computation was the most time consuming task Exec. time cost of energy terms Bond Energy: 1.27% Angle Energy: 0.93% Dihedral Energy: 2.13% Non-bonding Interactions Electrostatic Lennard Jones Solvent Energy Total: % Update alphas: 27.96% 19

20 CUDA Implementation Update Alphas (27.96%) All to all atom interactions No major issues Non-bonding Terms (37.58%) List of interactions (atom pairs) Several cut-offs to reduce the number of interactions CUDA implementation New data structure for interactions list in GPU With atomics Profiling showed high overheads Lack of DP atomics? High contention due to list order? Without atomics Main kernel + custom reduction to aggregate results ~3x faster than 1st approach 20

21 CUDA Implementation (II) Energy computations are performed multiple times in different parts of PELE Energy computations in time Maintain data coherent between CPU and GPU High code complexity Porting everything inbetween involves a major refactoring PELE call graph 21

22 CPU/GPU data coherence Explicit CPU GPU copies Code is harder to follow and maintain Complex application: Difficult to track which CPU code uses GPU results Usage may depend on many conditions Programmers tend to be conservative Always copy GPU results to host after the kernel If not used, performance cost for no reason Automatic CPU GPU copies CUDA Unified Virtual Memory (UVM) Unified CPU & GPU data structures Allocation pointers can be used both in the CPU and GPU CUDA runtime manages the copies internally Custom std::allocator for std::vectors 22

23 UVM profiling 4KB copies are not large enough to get maximum PCIe bandwidth Also, some unnecessary copies The runtime has to be conservative because it doesn t always know what s input or output Our use of streams and allocations attached to them was not optimal 23

24 Semi-automatic memory manager UVM style It maintains pairs of allocations (CPU & GPU) DtoH copies are only performed when data is really needed in the CPU A page-fault handler detects CPU accesses Copies all the allocation at once Before launching a kernel Call owner_gpu(void* host_ptr, access_type) Access types Better bandwidth Read, Write, ReadWrite, FullWrite Returns gpu_ptr After the kernel launch Call owner_cpu(...) to notify the memory manager As said, copies are done lazily when needed 24

25 Performance comparison UVM Semi-automatic memory manager Semi-automatic memory manager has better performance Mainly because of better PCIe bandwidth 25

26 Results (I) 55x 5.29x 15.09x 26

27 Results (II) 2.4x 2x Upper bound 2.9x (Amdahl s law) PELE acceleration is still ongoing Non-bonding list generation Computations in perturbation step Etc. 27

28 Conclusions

29 Conclusions Acceleration of existing applications Some parts are accelerated while others are kept in the CPU Maintain data coherence between CPU & GPU is complex We showed two examples: WARIS-Transport Simple enough to port most of the computations to GPU and keep data there PELE Complex app use a manager to handle the copies UVM is a great tool to automatize the copies We implemented a Semi-automatic memory manager to improve the performance Atomics might have a large performance impact Store partial results and apply a reduction step after the kernel Libraries can help with reductions CUB, Modern GPU, etc. 29

30 Thank you! For further information please contact

Accelerating Scientific Applications on GPUs

Accelerating Scientific Applications on GPUs Pau Farré Gonzalez A thesis submitted in partial fulfilment for the degree of Master in Innovation and Research in Informatics High Performance Computing in