Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Size: px

Start display at page:

Download "Overlapping Computation and Communication for Advection on Hybrid Parallel Computers"

Emmeline Simpson
6 years ago
Views:

1 Overlapping Computation and Communication for Advection on Hybrid Parallel Computers James B White III (Trey) trey@ucar.edu National Center for Atmospheric Research Jack Dongarra dongarra@eecs.utk.edu University of Tennessee, Knoxville Programming Weather, Climate, and Earth-System Models on Heterogeneous Multi-Core Platforms NCAR, September 8, 211 based on work first presented at IPDPS, Anchorage, AK, May 17, 211 Portions of this work were funded by the Office of Biological and Environmental Research and the Office of Advanced Scientific Computing Research, both of the US Department of Energy. This research used resources of the OLCF at Oak Ridge National Laboratory and of NERSC at Lawrence Berkeley National Laboratory, both of which are supported by the Office of Science of the US Department of Energy.

Test Case Linear advection with constant uniform velocity Three-dimensional cube with periodic boundaries Advect Gaussian wave through cube

2 Test Case Linear advection with constant uniform velocity Three-dimensional cube with periodic boundaries Advect Gaussian wave through cube corner back to original position Strong scaling, 42x42x42 Explicit 2nd-order single-stage integration, 3x3x3 centered stencil, 64-bit precision

3 Computers System JaguarPF Hopper II Lens Yona Compute nodes Memory per node (GB) AMD Opteron sockets per node Cores per Opteron socket Opteron clock (GHz) Interconnect Cray SeaStar 2+ Cray Gemini DDR Infiniband QDR Infiniband MPI Cray MPT 4.. Cray MPT OpenMPI OpenMPI 1.7a1 NVIDIA Tesla GPU C16 C25 GPU memory (GB) 4 3

4 Computers System JaguarPF Hopper II Lens Yona Compute nodes Memory per node (GB) AMD Opteron sockets per node Cores per Opteron socket Opteron clock (GHz) Interconnect Cray SeaStar 2+ Cray Gemini DDR Infiniband QDR Infiniband MPI Cray MPT 4.. Cray MPT OpenMPI OpenMPI 1.7a1 NVIDIA Tesla GPU C16 C25 GPU memory (GB) 4 3

5 Computers System JaguarPF Hopper II Lens Yona Compute nodes Memory per node (GB) AMD Opteron sockets per node Cores per Opteron socket Opteron clock (GHz) Interconnect Cray SeaStar 2+ Cray Gemini DDR Infiniband QDR Infiniband MPI Cray MPT 4.. Cray MPT OpenMPI OpenMPI 1.7a1 NVIDIA Tesla GPU C16 C25 GPU memory (GB) 4 3

6 Implementations Single task (Fortran + OpenMP) Bulk-synchronous MPI MPI using nonblocking communication for overlap MPI using OpenMP threading for overlap GPU resident (CUDA Fortran) GPU with bulk-synchronous MPI GPU with MPI overlap using CUDA streams CPU and GPU computation with bulk-synchronous MPI CPU and GPU computation partitioned for overlap with nonblocking MPI and CPU-GPU communication

7 CPU-GPU Domain Decomposition global domain decomposed into MPI-task domains! task domain partitioned into CPU and GPU domains! CPU(s)! halo for MPI communication! GPU! halos for CPU-GPU communication!

8 Lines of Code Single (with OpenMP) Bulk Synchronous Nonblocking Overlap OpenMP Overlap GPU Resident GPU Bulk Synchronous GPU Overlap CPU GPU Bulk Synchronous CPU GPU Overlap

9 Lines of Code Similar Single (with OpenMP) Bulk Synchronous Nonblocking Overlap OpenMP Overlap GPU Resident GPU Bulk Synchronous GPU Overlap CPU GPU Bulk Synchronous CPU GPU Overlap

10 Lines of Code MPI adds 5-75% Single (with OpenMP) Bulk Synchronous Nonblocking Overlap OpenMP Overlap GPU Resident GPU Bulk Synchronous GPU Overlap CPU GPU Bulk Synchronous CPU GPU Overlap

11 Lines of Code 4 times the code Single (with OpenMP) Bulk Synchronous Nonblocking Overlap OpenMP Overlap GPU Resident GPU Bulk Synchronous GPU Overlap CPU GPU Bulk Synchronous CPU GPU Overlap

12 Best JaguarPF Performance Bulk Synchronous Nonblocking Overlap OpenMP Overlap Cores

13 Best JaguarPF Performance nonblocking bulk synchronous 1 Bulk Synchronous Nonblocking Overlap OpenMP Overlap Cores

14 Best Hopper-II Performance Bulk Synchronous Nonblocking Overlap OpenMP Overlap Cores

15 Best Hopper-II Performance nonblocking JaguarPF plot Bulk Synchronous Nonblocking Overlap OpenMP Overlap Cores bulk synchronous

16 Bulk-Synchronous Performance on JaguarPF Threads/Task Cores

17 Bulk-Synchronous Performance on JaguarPF Threads/Task each ratio best somewhere 1 best ratio grows with core count Cores

18 Bulk-Synchronous Performance on Hopper II Cores Threads/Task

19 Bulk-Synchronous Performance on Hopper II best ratio grows with core count 24 never best Cores Threads/Task

20 GPU-Resident Performance on Lens Y Block X Block

21 GPU-Resident Performance on Lens 36.3 GF x11 used for remaining experiments Y Block X Block

22 GPU-Resident Performance on Yona Y Block X Block

23 GPU-Resident Performance on Yona 86.2 GF x8 used for remaining experiments Y Block X Block

24 Best Performance on Lens CPU GPU Overlap CPU GPU Bulk Sync Bulk Sync Nonblocking Overlap OpenMP Overlap GPU Overlap GPU Bulk Sync Cores (1 GPU per 16 cores)

25 Best Performance on Lens CPU GPU Overlap CPU GPU Bulk Sync Bulk Sync Nonblocking Overlap OpenMP Overlap GPU Overlap GPU Bulk Sync almost 2x 1 5 GPU resident Cores (1 GPU per 16 cores)

26 Best Performance on Yona CPU GPU Overlap CPU GPU Bulk Sync GPU Overlap GPU Bulk Sync Bulk Sync Nonblocking Overlap OpenMP Overlap Cores (1 GPU per 12 cores)

27 Best Performance on Yona CPU GPU Overlap CPU GPU Bulk Sync GPU Overlap GPU Bulk Sync Bulk Sync Nonblocking Overlap OpenMP Overlap over 2.6x 2 1 GPU resident Cores (1 GPU per 12 cores)

28 CPU-GPU Overlap Performance on Lens Threads/Task, Box Width 16, 2 16, 4 16, 6 8, 4 8, Cores (1 GPU per 16 Cores)

29 CPU-GPU Overlap Performance on Lens threads/task goes up box width goes down Threads/Task, Box Width 16, 2 16, 4 16, 6 8, 4 8, Cores (1 GPU per 16 Cores)

30 CPU-GPU Overlap Performance on Yona Threads/Task, Box Width 12, 1 6, 1 6, Cores (1 GPU per 12 Cores)

31 CPU-GPU Overlap Performance on Yona Threads/Task, Box Width 12, 1 6, 1 6, 3 threads/task goes up box width starts small, goes smaller Cores (1 GPU per 12 Cores)

32 Overlapping Computation and Communication for Advection on Hybrid Parallel Computers MPI overlap less important for this test But tuning threads/task is important Overlapping CPU computation, GPU computation, MPI communication, and CPU-GPU communication - Improves performance by more than 2x - Matches GPU-resident performance per GPU Best performance from giving minimal (but nonvanishing) work to CPU Performance comes at a 4x cost in lines of code

33 Overlapping Computation and Communication for Advection on Hybrid Parallel Computers James B White III (Trey) trey@ucar.edu National Center for Atmospheric Research Jack Dongarra dongarra@eecs.utk.edu University of Tennessee, Knoxville Programming Weather, Climate, and Earth-System Models on Heterogeneous Multi-Core Platforms NCAR, September 8, 211 based on work first presented at IPDPS, Anchorage, AK, May 17, 211 Portions of this work were funded by the Office of Biological and Environmental Research and the Office of Advanced Scientific Computing Research, both of the US Department of Energy. This research used resources of the OLCF at Oak Ridge National Laboratory and of NERSC at Lawrence Berkeley National Laboratory, both of which are supported by the Office of Science of the US Department of Energy.

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid