GPU-optimized computational speed-up for the atmospheric chemistry box model from CAM4-Chem

Size: px

Start display at page:

Download "GPU-optimized computational speed-up for the atmospheric chemistry box model from CAM4-Chem"

Coleen Poole
5 years ago
Views:

Drake, Qingzhao Zhu, Azzam Haidar, Mark Gates, Stanimire Tomov, Jack Dongarra Present at CESM

1 GPU-optimized computational speed-up for the atmospheric chemistry box model from CAM4-Chem Presenter: Jian Sun Advisor: Joshua S. Fu Collaborator: John B. Drake, Qingzhao Zhu, Azzam Haidar, Mark Gates, Stanimire Tomov, Jack Dongarra Present at CESM Atmosphere Model, Chemistry Climate, and Whole Atmosphere Working Group meetings February 12-14, 2018 NCAR's Mesa Lab, Boulder, Colorado

2 Background Community Earth System Model (CESM) Five Components: Atmosphere (CAM) Land (CLM) Ocean (POP) Sea Ice (CSIM) Coupler Support various scenarios, resolutions and machines

3 Schematic Vertical Grid (Height or Pressure) Horizonal Grid (Latitude-Longitude)

4 Chemistry Expression In current CAM4-Chem, the chemistry is represented by the following ODE: Dy Dt = F y = P y L y + I(y) y is the species volume mixing ratios Chemical production Chemical loss Independent forcing (lightning, aircraft)

5 First-order implicit solver To solve the system, the ODE is discretized with respect to time (First-order implicit Euler method) and solved by the Newton-Raphson method: J is the Jacobian matrix for the function G GG yy kk = yykk yy nn tt JJ yy kk = dddd(yykk ) dddd FF(yy kk ) y k is the solution after k iterations JJ yy kk yy = GG(yy kk ) yy kk+11 = yy kk + yy y k+1 is the solution after k+1 iterations The iteration is terminated when Δy is less than a user-specified tolerance and y k+1 is updated as the solution y n+1 for the next time step.

6 Second-order Rosenbrock solver For the second-order Rosenbrock (ROS-2) solver: (I hγa)k 1 = F(y n ) I hγa k 2 = F y n + hk 1 2k 1 y n+1 = y n hk hk 2 A is the Jacobian matrix of right hand side, i.e. A = (y) k 1, k 2 are intermediate solutions

7 Model configuration CAM4-Chem with TROP_MOZART mechanism Horizontal resolution: 0.9⁰ x 1.25⁰ (latitude by longitude) Vertical resolution: 26 layers (~3 hpa) Machine: Titan at Oak Ridge National Lab (ORNL) Time step size: 1800 seconds Number of Processors: 1,536

8 Computational efficiency Computational time for chemistry Solver IMP solver ROS-2 solver IMP solver ROS-2 solver Total CPU time for Chemistry 1 month simulation 59 hours 31 hours 1 year simulation 686 hours 360 hours Saved time: 47% Speedup: 1.9 The speedup is stable for longterm simulation

9 That s all?

10 Titan configuration Compute nodes: 18,688 AMD Cores: 299,008 NVIDIA Telsa K20X GPU accelerators: 18,688

11 Basic Terminologies CUDA: Framework to utilize GPU developed by NVIDIA GPU issues instruction at warp level: 32 consecutive threads SIMT: Single Instruction Multiple Thread Kernel: Function worked on the GPU Name<<<blocksPerGrid, threadsperblock>>>(paremeter list)

size: 32 threads For each SMX: Shared memory: 64 KB Maximum

12 NVIDIA K20X (Kepler) For each GPU: Streaming multiprocessors: 14 Double precision CUDA cores: 896 Constant memory: 64 KB Warp size: 32 threads For each SMX: Shared memory: 64 KB Maximum blocks: 16 Maximum warps: 64 Maximum threads: 2048 Registers: 65,536

13 Design strategies Extract atmospheric chemistry module from CAM4-Chem as a box model Provide arbitrary input to drive the chemistry box model Optimize chemistry box model on the GPU and compare to the CPU core

14 Multiple-kernel vs. One-kernel Hit the limit for registers 256 threads per SMX 3,584 threads per GPU NVIDIA Visual Profiler For < 3,584 loop iterations, one-kernel version saves ~5% For > 3,584 loop iterations, multiple-kernel version saves ~10%

15 Shared memory Solution array (y) Intermediate Solution (k 1, k 2 ) For 448 loop iterations, shared memroy version saves ~26% For > 448 loop iterations, shared memory version takes ~4.4x

16 Constant memory With shared memory, constant memroy version saves ~6.7% Without shared memroy, constant memory version saves ~5.2%

17 CUDA streams NVIDIA Visual Profiler Save ~16% time compared to multiple-kernel version Save ~11.2% time compared to one-kernel version

18 Architecture Slower than computation Memory copy

19 Memory copy Contiguous memory: Save 10 to 40% time compared to baseline Pinned memory: Take 1.1x to 2.1x time compared to baseline

20 GPU vs. CPU CUDA streams Multiple kernels Constant memory Contiguous allocation GPU is 2.33x to 11.7x faster than CPU for computation alone GPU is 1.29x to 3.82x faster than CPU for total wall-clock time

21 Thank you

22 Number of threads per block Close time for 448 loop iterations ~5x difference for 7,168 loop iterations

Computational benefit of GPU optimization for the atmospheric chemistry modeling

Computational benefit of GPU optimization for the atmospheric chemistry modeling Jian Sun 1,*, Joshua S. Fu 1,2, John B. Drake 1, Qingzhao Zhu 1, Azzam Haidar 3, Mark Gates 3, Stanimire Tomov 3 and Jack