Optimizing Multiple GPU FDTD Simulations in CUDA

Size: px

Start display at page:

Download "Optimizing Multiple GPU FDTD Simulations in CUDA"

Eric Park
5 years ago
Views:

1 Center of Applied Electromagnetic Systems Research (CAESR) Optimizing Multiple GPU FDTD Simulations in CUDA Matthew J. Inman Atef Z. Elsherbeni Center For Applied Electromagnetics Systems Research (CAESR) Electrical Engineering Department University, MS 1

2 Center of Applied Electromagnetic Systems Research (CAESR) Computer Information Quad Core 2.6 GHz Intel Core i7 6 GB of DDR Ram 2xNVIDIA Tesla T16 GPU 2

3 Center of Applied Electromagnetic Systems Research (CAESR) Optimizations When dealing with single GPU simulations, optimizations are straightforward Block Sizes Domain Sizes Memory Access Patterns (Global and Shared) Memory Use How to extend these optimizations to multiple GPU simulations 3

4 Center of Applied Electromagnetic Systems Research (CAESR) Optimizations 1. Selecting the proper domain size 1. Important when dividing your domain 2. Selection proper methods to transfer data between GPU s 1. Selecting proper components 2. Minimizing impact on runtime 4

5 Computation Time vs Domain Size Average Time Step Computation Time (ms) Average Time Step Computation Time vs Computational Domain Size NY=1, NZ=1 NX=1, NZ=1 NX=1, NY= NX/NY/NZ (cells) 5

6 No Matter What We Are Trying to Model, It eventually becomes a 1D array in Memory! Memory z x y 6

7 Y Z X=1 Y=1 Z=1 X X=2 Y=1 Z=1 X=1 Y=2 Z=1 X=2 Y=2 Z=1 X=1 Y=1 Z=2 Memory Translation X=2 Y=1 Z=2 X=1 Y=2 Z=2 X=2 Y=2 Z= D Grid Memory Layout 7

8 Memory Access for Neighboring X Y Z X Only 1 Memory Location Away 3D Grid Memory Layout 8

9 Memory Access for Neighboring Y Y Z X NX Memory Locations Away NX NX 3D Grid Memory Layout 9

10 Y Memory Access for Neighboring Z Z X NX*NY Memory Locations Away NX*NY NX*NY 3D Grid Memory Layout 1

11 Computation Time vs Domain Size Average Time Step Computation Time (ms) Average Time Step Computation Time vs Computational Domain Size NY=1, NZ=1 NX=1, NZ=1 NX=1, NY= NX/NY/NZ (cells) 11

12 Average Time Step Computation Time (ms) Average Time Step Computation Time vs Computational Domain Size NX/NY/NZ (cells) Computation Time vs Domain Size NZ Varies, NX and NY = 1 NY=1, NZ=1 NX=1, NZ=1 NX=1, NY=1 Neighboring X: ±1 Memory Location Neighboring Y: ±NX Memory Locations Neighboring Z: ±NX*NY Memory Locations 12

13 Average Time Step Computation Time (ms) Average Time Step Computation Time vs Computational Domain Size NX/NY/NZ (cells) Computation Time vs Domain Size NY Varies, NX and NZ = 1 NY=1, NZ=1 NX=1, NZ=1 NX=1, NY=1 Neighboring X: ±1 Memory Location Neighboring Y: ±NX Memory Locations Neighboring Z: ±NX*NY Memory Locations 13

14 Average Time Step Computation Time (ms) Average Time Step Computation Time vs Computational Domain Size NX/NY/NZ (cells) Computation Time vs Domain Size NX Varies, NY and NZ = 1 NY=1, NZ=1 NX=1, NZ=1 NX=1, NY=1 Neighboring X: ±1 Memory Location Neighboring Y: ±NX Memory Locations Neighboring Z: ±NX*NY Memory Locations 14

Center of Applied Electromagnetic Systems Research (CAESR) Transferring Ghost Cells In decomposing any FDTD domain, Each GPU will need some of

15 Center of Applied Electromagnetic Systems Research (CAESR) Transferring Ghost Cells In decomposing any FDTD domain, Each GPU will need some of an adjacent domain to complete it s update. These ghost cells will need to be transferred each time step. GPU 1 Ghost Cells Ghost Cells GPU 2 15

Average Data Transfer Time (ms) 11 1 9 8 7 6 5 4 3 2 Average Data Transfer Time (Whole Field Component) vs Computational Domain Size

16 Average Data Transfer Time (ms) Average Data Transfer Time (Whole Field Component) vs Computational Domain Size Downloading Whole Field Component NY=1, NZ=1 NX=1, NZ=1 NX=1, NY=1 Largest NX Smallest Transfer Time NX/NY/NZ (cells) 16

14 12 1 8 6 4 2 Data Transfer Time as Percent of

Whole Field Component 1 2 3 4 5 6 7 NX/NY/NZ (cells)

17 Data Transfer Time as Percent of Computational Time (Whole Field Component) vs Computational Domain Size 2 NY=1, NZ=1 18 NX=1, NZ=1 NX=1, NY=1 16 Average Data Transfer Time (%) Downloading Whole Field Component NX/NY/NZ (cells) Largest NX Smallest Transfer Time But Still 8%!! And this is for only 1 Sided Problems! 17

Average Data Transfer Time (ms) 1.9.8.7.6.5.4.3.2.

18 Average Data Transfer Time (ms) Average Data Transfer Time (Partial Field Component) vs Computational Domain Size Downloading Partial Field Component NY=1, NZ=1 NX=1, NZ=1 NX=1, NY=1 Very Minor Differences in Transfer Time NX/NY/NZ (cells) 18

Average Data Transfer Time (%) 3.5 3 2.5 2 1.5 1.

19 Average Data Transfer Time (%) Downloading Partial Field Component Data Transfer Time as Percent of Computational Time (Partial Field Component) vs Computational Domain Size 5 NY=1, NZ=1 4.5 NX=1, NZ=1 NX=1, NY=1 4 Negligible Difference NX/NY/NZ (cells) 19

20 Center of Applied Electromagnetic Systems Research (CAESR) Domain Size on Data Transfer Selecting the proper size and method to exchange data has large effects on runtime Downloading entire field component is easiest method but most inefficient Downloading partial field component is harder but kernel overhead is minor compared to reduction in runtime 2

21 Center of Applied Electromagnetic Systems Research (CAESR) Questions? Thank You 21

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran G. Ruetsch, M. Fatica, E. Phillips, N. Juffa Outline WRF and RRTM Previous Work CUDA Fortran Features RRTM in CUDA