Optimizing Weather Model Radiative Transfer Physics for the Many Integrated Core and GPGPU Architectures

Size: px

Start display at page:

Download "Optimizing Weather Model Radiative Transfer Physics for the Many Integrated Core and GPGPU Architectures"

Pamela Bryant
6 years ago
Views:

Optimizing Weather Model Radiative Transfer Physics for the Many Integrated Core and GPGPU Architectures John Michalakes NOAA/NCEP/Environmental Modeling Center (IM

1 Optimizing Weather Model Radiative Transfer Physics for the Many Integrated Core and GPGPU Architectures John Michalakes NOAA/NCEP/Environmental Modeling Center (IM Systems Group) University of Colorado at Boulder Mike Iacono, David Berthiaume AER Heterogeneous Multi-core workshop, Boulder 17 Sep 2014 NOAA/NWS/Environmental Modeling Center

$(day/night and cloud fraction) Coded as 1-D vertical columns but this dimension does not vectorize Used in many weather and climate models NCAR WRF NCAR CAM5 and CESM1 NASA GEOS-5 NOAA NCEP GFS, CFS,$

2 Rapid Radiative Transfer Model (RRTMG*) Accurate calculation of fluxes and cooling rates from incoming (shortwave) and outgoing (longwave) radiation Significant computational cost Load imbalance (day/night and cloud fraction) Coded as 1-D vertical columns but this dimension does not vectorize Used in many weather and climate models NCAR WRF NCAR CAM5 and CESM1 NASA GEOS-5 NOAA NCEP GFS, CFS, RUC ECMWF IFA and ERA40 ECHAM5 No Vector! One column of a weather or climate model domain (*Iacono et al. JGR, 2008; Mlawer et al., JGR, 1997) 2 NOAA/NWS/Environmental Modeling Center

3 Performance results: RRTMG Kernel on Xeon Phi and host Xeon (SNB) Workload 1 node of 80 node NMMB run 4km CONUS domain 1 RRTMG invocation columns, 60 levels 46.5 billion DP floating point ops 3 NOAA/NWS/Environmental Modeling Center

4 Performance results: RRTMG Kernel on Xeon Phi and host Xeon (SNB) Workload 1 node of 80 node NMMB run 4km CONUS domain 1 RRTMG invocation columns, 60 levels 46.5 billion DP floating point ops Code restructuring Increase concurrency Increase vectorization Decrease memory system pressure Performance improves on host too 4 NOAA/NWS/Environmental Modeling Center

Performance results: RRTMG Kernel on Xeon Phi and host Xeon (SNB) Workload 1 node of 80 node NMMB run 4km CONUS domain 1 RRTMG invocation 18819

5 billion DP floating point ops Code restructuring Increase concurrency Increase vectorization Decrease memory system pressure Performance

5 Performance results: RRTMG Kernel on Xeon Phi and host Xeon (SNB) Workload 1 node of 80 node NMMB run 4km CONUS domain 1 RRTMG invocation columns, 60 levels 46.5 billion DP floating point ops Code restructuring Increase concurrency Increase vectorization Decrease memory system pressure Performance improves on host too Whole Code Improvement 1.5x in radiation kernel 1.7x overall code improvement other parts benefit too decreased effect of load imbalance? Note: importance of -fno-alias Dual Sandy Bridge node on NOAA WCOSS System 5 NOAA/NWS/Environmental Modeling Center

thread scheduling Vectorization Originally vertical pencils Extend inner dimension of lowest-level

6 Restructuring RRTMG in NMM-B west -- east Concurrency and locality Original RRTMG called in OpenMP threaded loop over South-North dimension Rewrite loop to iterate over tiles in two dimensions Dynamic thread scheduling Vectorization Originally vertical pencils Extend inner dimension of lowest-level tiles to width of SIMD unit on KNC Static definition of VECLEN call tree 6 NOAA/NWS/Environmental Modeling Center

thread scheduling Vectorization Originally vertical pencils Extend inner dimension of

7 Restructuring RRTMG in NMM-B Concurrency and locality Original RRTMG called in OpenMP threaded loop over South-North dimension Rewrite loop to iterate over tiles in two dimensions Dynamic thread scheduling Vectorization Originally vertical pencils Extend inner dimension of lowest-level tiles to width of SIMD unit on KNC Static definition of VECLEN 7 NOAA/NWS/Environmental Modeling Center

8 Other transformations Array index reordering ABSA, ABSB lookup tables in LWRAD First index is indirect and effectively random access 2 nd index over spectral interval is accessed sequentially Inverting these enables vectorization over spectral intervals Compute instead of lookup tables in-place computation of trans in RRTM longwave saves about 3% overall on MIC and 3.5% on Xeon for the test workload 8 NOAA/NWS/Environmental Modeling Center

Other transformations Task interleaving Original: pure SPMD implementation over threads Each of four threads on a core calls shortwave then longwave Threads hit the high pressure sections of the code

9 Other transformations Task interleaving Original: pure SPMD implementation over threads Each of four threads on a core calls shortwave then longwave Threads hit the high pressure sections of the code in unison But longwave and shortwave computations are independent Modified: Half the threads on each core reverse the order, longwave then shortwave Helps even out the spikes in resource pressure on each core 9 NOAA/NWS/Environmental Modeling Center

10 Effect of Optimizations on RRTMG Kernel Improvement 2.8x Overall 5.3x in SWRAD 0.75x in LWRAD (degraded) Increasing chunk size results in 2.5x increase in working set size from 407KB to 1034KB per thread 4x increase in L2 misses, which Task Interleaving reduced by 30% in SWRAD Memory traffic Increased from 59 to 124 GB/s, still short of saturation Key bottleneck is memory latency Michalakes, Iacono, Jessup. Optimizing Weather Model Radiative Transfer Physics for Intel's Many Integrated Core (MIC), Architecture. Preprint NOAA/NWS/Environmental Modeling Center

GB/s, still short of saturation Key bottleneck is memory latency Hyperthreading effective only to 2 threads Software prefetching helps up to 3 threads Michalakes, Iacono, Jessup.

11 Effect of Optimizations on RRTMG Kernel Improvement 2.8x Overall 5.3x in SWRAD 0.75x in LWRAD (degraded) Increasing chunk size results in 2.5x increase in working set size from 407KB to 1034KB per thread 4x increase in L2 misses, which Task Interleaving reduced by 30% in SWRAD Memory traffic Increased from 59 to 124 GB/s, still short of saturation Key bottleneck is memory latency Hyperthreading effective only to 2 threads Software prefetching helps up to 3 threads Michalakes, Iacono, Jessup. Optimizing Weather Model Radiative Transfer Physics for Intel's Many Integrated Core (MIC), Architecture. Preprint NOAA/NWS/Environmental Modeling Center

12 Comparison to GPU Performance AER Development of RRTMGPU * Originally funded by NASA for GEOS-5 DOE Climate Modeling SciDAC Program funding application to WRF RRTMGPU_LW and SW implemented in WRF_v3.51 and testing in progress on NCAR Caldera Apples-to-apples comparisons Ported GPU version of AER s shortwave code to MIC Converted OpenACC threading directives to OpenMP Permuted loop ordering to favor vectorization on MIC or threading on GPU Neither code hyper-optimized Used fast math on both platforms; otherwise no arcane validity- or stabilitydegrading compiler options No coding to metal (no CUDA or vector intrinsics) GPU timings include PCI transfer overhead, KNC and Xeon timings assume native or symmetric execution *Different code from RRTMG in NMMB 12 NOAA/NWS/Environmental Modeling Center

13 Comparison to GPU Performance Sandy Bridge MIC Knight s Corner Ivy Bridge Haswell 13 NOAA/NWS/Environmental Modeling Center

14 Summary Restructuring to improve concurrency and vectorization for MIC also improved host multi-core processor Current MIC and GPU only just hold their own relative to Xeon Similar story with other weather model physics (e.g. WSM5 work) Reliable predictive performance model is elusive: VTune-reported metrics e.g. Latency Impact and vector utilization often do not agree with observed performance Large working set sizes in NWP physics is problematic: Latency bound: spills out of cache (MIC) or local stores (GPU) Restructuring for vector (MIC) or threading (GPU) makes worse KNL outlook Hostless bootable KNL nodes will eliminate need for offload High bandwidth on-package memory won t help if we re latency bound Cache per core isn t likely to increase Will KNL have better latency hiding in other ways? 14 NOAA/NWS/Environmental Modeling Center

Porting and Tuning WRF Physics Packages on Intel Xeon and Xeon Phi and NVIDIA GPU

Porting and Tuning WRF Physics Packages on Intel Xeon and Xeon Phi and NVIDIA GPU Tom Henderson Thomas.B.Henderson@noaa.gov Mark Govett, James Rosinski, Jacques Middlecoff NOAA Global Systems Division