Running the FIM and NIM Weather Models on GPUs

Running the FIM and NIM Weather Models on GPUs Mark Govett Tom Henderson, Jacques Middlecoff, Jim Rosinski, Paul Madden NOAA Earth System Research Laboratory

Global Models 0 to 14 days 10 to 30 KM resolution 1000 CPU cores Hurricane Sandy NOAA FIM (and other) models accurately predict hurricane track 5 days ahead

Regional Models 0 to 48 hours 1 to 3 KM resolution 1000 CPU cores Hurricane Sandy HRRR model consistently predicted gusts above 70 knots from the southeast over the New York area up to 15 hours in advance.

Global Cloud Resolving Models 2-4 KM resolution Minimum of 80,000 CPUs to run ~50 percent of real-time Need to get to 1-2 percent of real time for NWS Model Developments NIM: NOAA ESRL USA MPAS: NCAR USA NICAM: JAMSTEC Japan ICON DWD Germany CubeSphere NASA USA Icosahedral Grid Other efforts under early development world wide Key ingredient is massive amounts of computing Estimated 200,000 to 300,000 CPU cores to be useful

F2C-ACC Fortran GPU Compiler Developed in 2009, before commercial compilers were available Used to parallelize NIM & FIM dynamics Focus on minimizing changes to preserve original code Single source to preserve performance portability Run on CPU, GPU, serial, parallel Advanced capabilities to preserve single source, improve performance Used to evaluate commercial compilers 2011: evaluated CAPS, PGI (Henderson) 2012: shared stand-alone tests with all the vendors performance, correctness, language support Plan another evaluation in 2013

Why is single source so Important? Models are increasingly complex Modeling Systems Significant and cost effort to develop and maintain over lifecycle Must be performance portable & interoperable Support research and operational use Super Computers NCEP Operations NOAA @ Oak Ridge E N S E M B L E S HYCOM WRF GFS FIM NIM physics dynamics chemistry Computing systems are becoming more diverse Intel, AMD, IBM-Power, + GPU, Intel MIC, +? Interoperabilty DoE Titan 20 PFlops (2013) NOAA @ West Virginia NOAA @ Boulder NCAR MIC Cluster TACC (2013)

OpenMP / F2C-ACC GPU Kernel Placement of directives are generally in the same location for GPU directives & OMP OMP: Need to have sufficient work to overcome startup overhead Data movement is implicit F2C accelerator model assumes data resides on CPU!$OMP PARALLEL DO PRIVATE (k, ) SCHEDULE (runtime)!acc$region(<nvl>,<ime-ims+1>) BEGIN!ACC$DO PARALLEL(1) do ipn=ips,ipe!acc$do VECTOR(1) do k=1,nvl worka(k,ipn) = tr3d(k,ipn,1) end do end do!acc$region END!OMP END PARALLEL DO

OpenACC GPU Kernel Unclear how OpenACC compilers handles data movement Explicitly listing each variable for each region can be tedious particularly if the region is big and complicated Unclear how if OpenACC compilers support Fortran 90 syntax Compiler analysis could determine parallelism implicitly!$acc DATA [copyin] [copyout]!$acc PARALLEL [ gang ] [ worker ] [ vector ]! worka(:,:) = tr3d(:,:,1)!unclear if openacc can handle this!$acc LOOP [gang] [worker] [vector] do ipn=ips,ipe!$acc LOOP [gang] [worker] [vector] do k=1,nvl worka(k,ipn) = tr3d(k,ipn,1) end do end do!$acc END PARALLEL

FIM Parallelization for Fine-Grain Well established code Designed in 2000 for CPUs Near operational status Running daily at 10, 15, 30 KM resolutions Multi-faceted development Ensembles, chemistry, ocean Code Structure Fortran Modules Deeper call tree than NIM Limited ability to change the code Demonstrate performance benefit No degradation in clarity of code Otherwise scientists must evaluate lat-lon a ( k, i, j ) FIM: k a [ k, indx) Parallelism - Dynamics GPU Blocking in horizontal Threading in vertical OpenMP, MIC Threading in horizontal Vectorize in vertical i

Validation of Results Before CUDA V4.3, digits of accuracy were used to compare FIM / NIM results between the CPU and GPU, MIC Variable Ndifs RMS (1) RMSE max DIGITS rublten 2228 0.1320490309E-03 0.2634E-09 0.3922E-05 5 rvblten 2204 0.2001348128E-03 0.6318E-09 0.2077E-04 4 exch_h 3316 0.1670498588E+02 0.8979E-05 0.8379E-05 5 hpbl 9 0.4522379124E+03 0.2688E-03 0.1532E-04 4 Small differences for 1 timestep can become significant when running a model over many timesteps Now model results are identical (Intel, MIC, NVIDIA) When the fused multiply-add instruction is turned off Eliminates truncation of arithmetic operations Significantly speeds parallelization and detects synchronization bugs Exceptions: Power, log functions and likely other intrisics System versions replaced by a library routine and used for correctness tests

FIM Performance Single Socket No changes to the FIM source code GPU timings used F2C-ACC compiler Optimized for Fermi GPU, further optimizations for Kepler needed Code changes need to improve hybgen performance (currently >50% of GPU runtime) FIM Dynamics Routines NVIDIA Fermi GPU 1 socket Intel CPU SandyBridge 1 socket Intel Xeon Phi KNC 1 socket NVIDIA Kepler GPU Early Results trcadv 1.28 2.10 1.09 (1.9) 0.99 (2.0) cnuity 1.04 1.13 0.49 (2.3) 0.68 (1.8) momtum 0.41 0.71 0.34 (2.1) 0.35 (2.1) hybgen 4.13 4.09 3.10 (1.3) 3.43 (1.2) TOTAL 8.01 8.87 5.86 (1.5) 6.30 (1.4)

NIM Parallelization for Fine Grain Uniform, hexagonal-based, icosahedral grid Novel indirect addressing scheme permits concise, efficient code Designed for fine-grain parallel in 2008 Dynamics Running on GPU, CPU, serial, parallel-mpi, openmp, MIC soon Physics: GFS, YSU OpenMP, MIC, GPU parallelization planned Testing at 120, 60, 30km Aqua-Planet to 300 days Testing at 120, 60 KM real data runs j k ipn NIM: a ( k, ipn ) k lat-lon a ( k, i, j ) NIM: i a [ k, indx) Parallelism - Dynamics GPU Blocking in horizontal Threading in vertical OpenMP, MIC Threading in horizontal Vectorize in vertical

NIM Serial Performance (2013) No changes to the source code Single Socket Performance 10K horizontal points, 96 vertical levels Very efficient CPU performance Measured 29% of peak performance (Intel Westmere) NIM Opteron Westmere SandyBridge Fermi K20x runtime 143.0 86.8 60.0 25.0 20.7 Parallel performance Being run on up to 160 GPUs Working on optimizing inter-gpu communications

GPU to GPU Communications Scalable Modeling System Distributed memory parallelization Directive-based Fortran compiler and MPI-based runtime library Used at NOAA for over 2 decades Extended to support inter GPU communications!sms$exchange(a,b) #1 GPU pack #5 GPU unpack #2 copy to CPU #4 copy to GPU CPU #2 SMS #3 MPI-based communications CPU #1

NIM Parallel Performance Weak Scaling 4096 columns / GPU, 96 vertical levels, 7000 time steps (~10 days) Kepler K20x Initial Results Over 50% of the run time was spent doing GPU-to-GPU communications Number of Kepler GPUs GPU to GPU Communications 10 1059 (57%) 1859 40 1071 (57%) 1878 160 1082 (57%) 1890 Total Time In Seconds Comm Function 160 GPUs Initialization 3 Pack Data on GPU 861 CPU GPU Copy 59 MPI Communications 82 UnPack on GPU 77 Total Time 1082 Total Time

NIM Parallel Performance Weak Scaling with Communications Optimization Moved collective operation to the CPU instead of doing it on the GPU using GPU MappedMemory Too many small writes across the PCIe bus Resulted in a 5-17x speedup for the UnPack Operation GPUs GPU to GPU Comm Time 10 232 (22%) 1034 40 247 (23%) 1054 160 266 (24%) 1076 Total Time Time (sec) Operation Time Iniitilalization 3 ( 1%) Pack Data on GPU 45 (17%) CPU GPU Copy 59 (22%) MPI Comms 82 (31%) UnPack on GPU 77 (29%) Total 266

3.5KM NIM on Titan in 2013 Dynamics on GPU, Physics on CPU + OMP GPU CPU Dynamics GPU to CPU Physics CPU To GPU Dynamics 10 day forecast, ~10,000 horizontal points / GPU Resolution KM Vertical Levels GPUs Dynamics CPU-GPU Transfer Physics 30 96 40 1700 400 1900 1.1 15 96 160 1700 400 1900 2.2 Total Time hours 7.5 96 640 1700 400 1900 4.4 (7.2%) 3.75 96 2560 1700 400 1900 8.8 (3.6%) 3.75 192 5120 1700 400 1900 8.8 (3.6%)