Experiences Programming and Optimizing for Knights Landing

Size: px

Start display at page:

Download "Experiences Programming and Optimizing for Knights Landing"

Charles Merritt
5 years ago
Views:

1 Experiences Programming and Optimizing for Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016 MultiCore 6 Workshop September 13 th 2016 (Post-workshop: slides 26 and 36 updated on 2016/09/22)

2 Outline Hardware and software Applications: NEPTUNE, MPAS, WRF and kernels Knights Landing overview Optimizing for KNL Scaling Performance Roofline Tools Summary 2

3 Models: NEPTUNE/NUMA Andreas Mueller, NPS, Monterey, CA; and M. Kopera, S. Marras, and F. X. Giraldo. Towards Operational Weather Prediction at 3.0km Global Resolution With the Dynamical Core NUMA. 96 th Amer. Met. Society Annual Mtg. January,

Models: NEPTUNE/NUMA Spectral element 4 th, 5 th and higher-order* continuous Galerkin (discontinuous planned) Cubed Sphere (also icosahedral) Adaptive Mesh Refinement (AMR) in development NEPTUNE 72

Sandy Computationally dense but highly scalable Constant width-one halo communication Good locality for next generation HPC MPI-only but OpenMP threading in development Example of Adaptive Grid

4 Models: NEPTUNE/NUMA Spectral element 4 th, 5 th and higher-order* continuous Galerkin (discontinuous planned) Cubed Sphere (also icosahedral) Adaptive Mesh Refinement (AMR) in development NEPTUNE 72 h forecast (5 km resolution) of accumulated precipitation for Hurr. Sandy Computationally dense but highly scalable Constant width-one halo communication Good locality for next generation HPC MPI-only but OpenMP threading in development Example of Adaptive Grid tracking a severe event courtesy: Frank Giraldo, NPS * This is not the same order as is used to identify the leading term of the error in finite difference schemes, which in fact describes accuracy. Evaluation of Gaussian quadrature over N+1 LGL quadrature points will be exact to machine precision as long as the polynomial integrand is of the order 2x(N 1) 3, or less. Gabersek et al. MWR Apr 2012.DOI: /MWR D

Models: NEPTUNE/NUMA NWP is one of the first HPC applications and, with climate, 3,000,000 threads has been one its key drivers (47,000 nodes) Exponential growth in HPC capability has translated

5 Models: NEPTUNE/NUMA NWP is one of the first HPC applications and, with climate, 3,000,000 threads has been one its key drivers (47,000 nodes) Exponential growth in HPC capability has translated directly to better forecasts and steadily improving value to the public HPC growth continues toward Peta-/Exaflop, but only parallelism is increasing More floating point capability Proportionately less cache, memory and I/O bandwidth NWP parallelism scales in one dimension less than complexity Ensembles scale computationally but move problem to I/O Can operational NWP stay on the bus? More scalable formulations Andreas Mueller, NPS, Monterey, CA; and M. Kopera, S. Marras, and F. X. Giraldo. Towards Operational Weather Prediction at 3.0km Global Resolution With the Dynamical Core NUMA. 96 th Amer. Met. Society Annual Mtg. January, ,000 threads (47 nodes) Expose and exploit all available parallelism, especially fine-grain Shows computational resources needed to maintain fixed solution time with increasing resolution 5

6 Models: NEPTUNE/NUMA See also: Andreas Müller, Michal A. Kopera, Simone Marras, Lucas C. Wilcox, Tobin Isaac, Francis X. Giraldo. Strong Scaling for Numerical Weather Prediction at Petascale with the Atmospheric Model NUMA (last revised 8 Sep 2016 (v3) (47,000 nodes) Exponential growth in HPC capability has translated directly to better forecasts and steadily improving value to the public NWP is one of the first HPC applications and, with climate, 3,000,000 threads has been one its key drivers HPC growth continues toward Peta-/Exaflop, but only parallelism is increasing More floating point capability Proportionately less cache, memory and I/O bandwidth NWP parallelism scales in one dimension less than complexity Ensembles scale computationally but move problem to I/O Can operational NWP stay on the bus? More scalable formulations Andreas Mueller, NPS, Monterey, CA; and M. Kopera, S. Marras, and F. X. Giraldo. Towards Operational Weather Prediction at 3.0km Global Resolution With the Dynamical Core NUMA. 96 th Amer. Met. Society Annual Mtg. January, ,000 threads (47 nodes) Expose and exploit all available parallelism, especially fine-grain Shows computational resources needed to maintain fixed solution time with increasing resolution 6

Models: MPAS Collaborative project between NCAR and LANL for developing atmosphere, ocean

weather studies Applications include global NWP and global atmospheric chemistry, regional

Volume, C grid Refinement capability Centroidal Voronoi tessellated unstructured mesh,

0) supports parallelism via MPI (horiz. domain decomp.

7 Models: MPAS Collaborative project between NCAR and LANL for developing atmosphere, ocean and other earth system simulation components for use in climate, regional climate and weather studies Applications include global NWP and global atmospheric chemistry, regional climate, tropical cyclones, convectionpermitting hazardous weather forecasting Finite Volume, C grid Refinement capability Centroidal Voronoi tessellated unstructured mesh, allows arbitrary in place horizontal mesh refinement HPC Readiness Current release (v4.0) supports parallelism via MPI (horiz. domain decomp.) Hybrid (MPI+OpenMP) parallelism implemented, undergoing testing Primarily funding: National Science Foundation and the DOE Office of Science 7

8 Hardware Intel Xeon Phi 7250 (Knights Landing) announced at ISC 16 in June 14 nanometer feature size, > 8 billion transistors 68 cores, 1.4 GHz modified Silvermont with out-of-order instruction execution Two 512-bit wide Vector Processing Units per core Peak ~3 TF/s double precision, ~6 TF/s single precision 16 GB MCDRAM (on-chip) memory, > 400 GB/s bandwidth Hostless no separate host processor and no offload programming Binary compatible ISA (instruction set architecture) 8

9 Optimizing for Intel Xeon Phi Most work in MIC programming involves optimization to achieve best share of peak performance Parallelism: KNL has up to 288 hardware threads (4 per core) and a total of more than 2000 floating point units on the chip Exposing coarse, medium and especially fine-grain (vector) parallelism in application to efficiently use Xeon Phi Locality: Reorganizing and restructuring data structures and computation to improve data reuse in cache and reduce floating point units idle-time waiting for data: memory latency Kernels that do not have high data reuse or that do not fit in cache require high memory bandwidth The combination of these factors affecting performance on the Xeon Phi (or any contemporary processor) can be characterized in terms of computational intensity, and conceptualized using The Roofline Model 9

10 Optimizing for Intel Xeon Phi Roofline Model of Processor Performance Bounds application performance as a function of computational intensity the number of floating point operations per byte moved between the processor and a level of the memory hierarchy. KNL Empirical Roofline Toolkit achievable performance Thanks: Doug Doerfler, LBNL computational intensity * Williams, S. W., A. Waterman, D. Patterson. Electrical Engineering and Computer Sciences, University of California at Berkeley Technical Report No. UCB/EECS October 17, 2008 slide 21 *Sometimes referred to as: Arithmetic intensity (registers L1): largely algorithmic Operational intensity (LLC DRAM): improvable

11 Optimizing for Intel Xeon Phi Roofline Model of Processor Performance Bounds application performance as a function of computational intensity If intensity is high enough, application is compute bound by floating point capability 2333 GFLOP/s double precision vector & fma KNL Williams, S. W., A. Waterman, D. Patterson. Electrical Engineering and Computer Sciences, University of California at Berkeley Technical Report No. UCB/EECS October 17, 2008 slide 22 (Vectorization means the processor can perform 8 double precision operations at once as long as the operations are independent)

12 Optimizing for Intel Xeon Phi Roofline Model of Processor Performance Bounds application performance as a function of computational intensity If intensity is high enough, application is compute bound by floating point capability 2333 GFLOP/s double precision vector & FMA 1166 GFLOP/s double precision vector, no FMA KNL 1166 GFLOP/sec (no FMA) Williams, S. W., A. Waterman, D. Patterson. Electrical Engineering and Computer Sciences, University of California at Berkeley Technical Report No. UCB/EECS October 17, 2008 slide 23 (FMA instructions perform a floating point multiply and a floating point addition as the same operation when the compiler can detect such expressions of the form: result = A*B+C)

13 Optimizing for Intel Xeon Phi Roofline Model of Processor Performance Bounds application performance as a function of computational intensity If intensity is high enough, application is compute bound by floating point capability 2333 GFLOP/s double precision vector & FMA 1166 GFLOP/s double precision vector, no FMA 145 GFLOP/s double precision no vector nor FMA KNL 1166 GFLOP/sec (no FMA) 145 GFLOP/sec (no vector) Williams, S. W., A. Waterman, D. Patterson. Electrical Engineering and Computer Sciences, University of California at Berkeley Technical Report No. UCB/EECS October 17, 2008 slide 24

14 Optimizing for Intel Xeon Phi Roofline Model of Processor Performance Bounds application performance as a function of computational intensity If intensity is high enough, application is compute bound by floating point capability If intensity is not enough to satisfy demand for data by the processor s floating point units, the application is memory bound 128 GB main memory (DRAM) 1166 GFLOP/sec (no FMA) KNL 145 GFLOP/sec (no vector) Williams, S. W., A. Waterman, D. Patterson. Electrical Engineering and Computer Sciences, University of California at Berkeley Technical Report No. UCB/EECS October 17, 2008 slide 25

15 Optimizing for Intel Xeon Phi Roofline Model of Processor Performance Bounds application performance as a function of computational intensity If intensity is high enough, application is compute bound by floating point capability If intensity is not enough to satisfy demand for data by the processor s floating point units, the application is memory bound 128 GB main memory (DRAM) 16 GB High Bandwidth memory (MCDRAM) 1166 GFLOP/sec (no FMA) KNL 145 GFLOP/sec (no vector) Williams, S. W., A. Waterman, D. Patterson. Electrical Engineering and Computer Sciences, University of California at Berkeley Technical Report No. UCB/EECS October 17, 2008 Among the ways to use MCDRAM: Configure KNL to use MCDRAM as direct mapped Cache Configure KNL to use MCDRAM as a NUMA region numactl membind=0./executable (main memory) numactl membind=1./executable (MCDRAM) More info:

16 Optimizing for Intel Xeon Phi Roofline Model of Processor Performance Bounds application performance as a function of computational intensity If intensity is high enough, application is compute bound by floating point capability If intensity is not enough to satisfy demand for data by the processor s floating point units, the application is memory bound 128 GB main memory (DRAM) 16 GB High Bandwidth memory (MCDRAM) KNL is nominally 3 TFLOP/sec but to saturate full floating point capability, need: 0.35 flops per byte from L1 cache 1 flop per byte from L2 cache 6 flops per byte from high bandwidth memory 25 flops per byte from main memory 1166 GFLOP/sec (no FMA) 145 GFLOP/sec (no vector) KNL Williams, S. W., A. Waterman, D. Patterson. Electrical Engineering and Computer Sciences, University of California at Berkeley Technical Report No. UCB/EECS October 17, 2008

17 Optimizing for Intel Xeon Phi Roofline Model of Processor Performance Bounds application performance as a function of computational intensity If intensity is high enough, application is compute bound by floating point capability If intensity is not enough to satisfy demand for data by the processor s floating point units, the application is memory bound 128 GB main memory (DRAM) 16 GB High Bandwidth memory (MCDRAM) KNL is nominally 3 TFLOP/sec but to saturate full floating point capability, need: 0.35 flops per byte from L1 cache 1 flop per byte from L2 cache 6 flops per byte from high bandwidth memory 25 flops per byte from main memory Hard to come by in real applications! NEPTUNE benefits from MCDRAM (breaks through the DRAM ceiling) but realizing only a fraction of the MCDRAM ceiling 36.5 GF/s (MCDRAM) 20.3 GF/s (DRAM) FLOPS/byte NEPTUNE E14P3L40 (U.S. Navy Global Model Prototype) 1166 GFLOP/sec (no FMA) 145 GFLOP/sec (no vector) KNL slide 28

Optimizing for Intel Xeon Phi Roofline Model of Processor Performance Bounds application performance as a function of computational intensity If intensity is high enough, application is compute bound

Bandwidth memory (MCDRAM) KNL is nominally 3 TFLOP/sec but to saturate full floating point capability, need: 0.

18 Optimizing for Intel Xeon Phi Roofline Model of Processor Performance Bounds application performance as a function of computational intensity If intensity is high enough, application is compute bound by floating point capability If intensity is not enough to satisfy demand for data by the processor s floating point units, the application is memory bound 128 GB main memory (DRAM) 16 GB High Bandwidth memory (MCDRAM) KNL is nominally 3 TFLOP/sec but to saturate full floating point capability, need: 0.35 flops per byte from L1 cache 1 flop per byte from L2 cache 6 flops per byte from high bandwidth memory 25 flops per byte from main memory Hard to come by in real applications! Optimizing for KNL involves increasing parallelism and locality slide 29 Increasing parallelism and parallel efficiency will raise performance for a given computational intensity Increasing locality will improve computational intensity and move the bar to the right improved paralleism improved locality 1166 GFLOP/sec (no FMA) 145 GFLOP/sec (no vector) KNL

19 NEPTUNE Performance on 1 Node Knights Landing NEPTUNE GFLOP/second, based on count of 9.13 billion double precision floating point operations per time step MPI-only higher is better GF/sec KNC KNL 4.2x DRAM MCDRM 2.1x HSW KNL 1.3x BDW KNL.9x 19

20 NEPTUNE Performance on 1 Node Knights Landing KNL peaks at 2 threads NEPTUNE per core Thread competition for resources (Q index is innermost in NEPTUNE) Better ILP Running out of work? MPI-only GFLOP/second, based on count of 9.13 billion double precision floating point operations per time step higher is better GF/sec KNC KNL 4.2x DRAM MCDRM 2.1x HSW KNL 1.3x BDW KNL.9x 20

21 MPAS Performance on 1 Node Knights Landing MPAS GFLOP/second, based on count of 3.7 billion single precision floating point operations per dynamics step higher is better GF/sec MPI and OpenMP but best time shown here was single-threaded KNL 7250 B0 68c (MCDRAM) Caveat: Plot shows floating point rate, not forecast speed. Number of threads 21

22 MPAS Performance on 1 Node Knights Landing MPAS GFLOP/second, based on count of 3.7 billion single precision floating point operations per dynamics step higher is better GF/sec MPI and OpenMP but best time shown here was single-threaded KNL 7250 B0 68c (MCDRAM) arithmetic intensity/ Number of threads 22

23 MPAS and NEPTUNE Performance on Knights Landing MPAS GFLOP/second, based on count of 3.7 billion single precision floating point operations per dynamics step higher is better GF/sec NEPTUNE GFLOP/second, based on count of 9.3 billion double precision floating point operations per dynamics step Caveat: Plot shows floating point rate, not forecast speed. Number of threads 23

24 Optimizing NEPTUNE 0.7 FLOP/Byte NUMA results on Blue Gene/Q suggest opportunites fro improvement from vectorization and reduced parallel overheads Work in progress with NRL s NEPTUNE model using the Supercell configuration and workload from NGGPS testing Analyzing and restructuring based on vector reports from compiler Adding multi-threading (OpenMP) Andreas Müller, Michal A. Kopera, Simone Marras, Lucas C. Wilcox, Tobin Isaac, Francis X. Giraldo. Strong Scaling for Numerical Weather Prediction at Petascale with the Atmospheric Model NUMA (Submitted on 5 Nov 2015 (v1), last revised 8 Sep 2016 (v3))

25 Optimizing NEPTUNE 25 higher is better

26 Optimizing NEPTUNE 26 higher is better

Optimizing NEPTUNE Optimizations Compile time specification of important loop ranges Number of variables per grid point Dimensions of an element Flattening nested loops to collapse for vectorization

27 Optimizing NEPTUNE Optimizations Compile time specification of important loop ranges Number of variables per grid point Dimensions of an element Flattening nested loops to collapse for vectorization Facilitating inlining Subroutines called from element loops in CREATE_LAPLACIAN and CREATE_RHS Matrix multiplies (need to try MKL DGEMM) Splitting apart loops where part vectorizes and part doesn t Fixing unaligned data (apparently still a benefit on KNL) Drag forces Using double precision, -fp-model precise Need a larger workload 27

28 Optimizing NEPTUNE 28 higher is better

29 Optimizing NEPTUNE SPMD OpenMP ** Create threaded region very high in call tree Improves locality more like MPI tasks Avoid starting and stopping threaded regions Avoid threading over different domain dimensions HOMME uses to block over elements, tracers and the vertical dimension HBM code modernization insights from a Xeon Phi experiment, Jacob Weismann Poulsen, DMI, Denmark, Per Berg, DMI, Denmark, Karthik Raman, Intel, USA MPI3 shared memory windows,,* Resyncing the memory model ate up any benefits I-MPI hard to beat on Phi; using shmem-like transport already? ** Jacob Weismann Poulsen (DMI) et al. HBM code modernization insights from a Xeon Phi experiment, CUG 2016, London Torsten Hoefler, ETH Zurich ( Brinskiy and Lubin: content/uploads/2015/06/pum21 2 An_Introduction_to_MPI 3.pdf * Jed Brown, To Thread or Not To Thread, NCAR Multi core Workshop, Threads.pdf 29

30 Optimizing NEPTUNE 30 higher is better

four NUMA regions, each assigned two MCDRAM

task 0 during the application s initialization

Measure traffic as number of cache lines

31 NUMA effects on MCDRAM? Sub-NUMA Clustering (SNC-4) divides nodes into four NUMA regions, each assigned two MCDRAM memory controllers (EDCs) Can first touch by MPI task 0 during the application s initialization phase cause uneven placement of addresses? Measure traffic as number of cache lines read/written using uncore event counters on each EDC *Available from Intel and coming in RHEL/CENTOS

32 NUMA effects on MCDRAM? Sub-NUMA Clustering (SNC-4) divides nodes into four NUMA regions, each assigned two MCDRAM memory controllers (EDCs) Can first touch by MPI task 0 during the application s initialization phase cause uneven placement of addresses? Measure traffic as number of cache lines read/written using uncore event counters on each EDC Linux kernel extensions to support counting KNL ucore in Perf utility* Functionality being added to PAPI Application level library courtesy of Sumedh Naik, Intel program NEPTUNE call ucore_setup... if (irank == irank0) then call ucore_start_counting endif!$omp parallel firstprivate(... do 1, ntimesteps compute a time step enddo!$omp end parallel if (irank == irank0) then call ucore_stop_counting call ucore_tellall endif... *Available from Intel and coming in RHEL/CENTOS

33 NUMA effects on MCDRAM? Sub-NUMA Clustering (SNC-4) divides nodes into four NUMA regions, each assigned two MCDRAM memory controllers (EDCs) Can first touch by MPI task 0 during the application s initialization phase cause uneven placement of addresses? Measure traffic as number of cache lines read/written using uncore event counters on each EDC OUTPUT FROM SUMEDH NAIK S UCORE EVENT LIBRARY: DDR Reads: DDR Reads Total: DDR Writes: DDR Writes Total: Linux kernel extensions to support counting KNL ucore in Perf utility* Functionality being added to PAPI Application level ucore library courtesy of Sumedh Naik, Intel *Available from Intel and coming in RHEL/CENTOS program NEPTUNE call ucore_setup... if (irank == irank0) then call ucore_start_counting endif!$omp parallel firstprivate(... do 1, ntimesteps MCDRAM Reads: MCDRAM Reads Total: compute a time step MCDRAM Writes: MCDRAM Writes Total: enddo!$omp end parallel if (irank == irank0) then call ucore_stop_counting call ucore_tellall endif...

34 NUMA effects on MCDRAM? LOWER is better Sub-NUMA Clustering (SNC-4) divides nodes into four NUMA regions, each assigned two MCDRAM memory controllers Can first touch by MPI task 0 during the application s initialization phase cause uneven placement of addresses?

35 NUMA effects on MCDRAM? LOWER is better Sub-NUMA Clustering (SNC-4) divides nodes into four NUMA regions, each assigned two MCDRAM memory controllers Can first touch by MPI task 0 during the application s initialization phase cause uneven placement of addresses?

36 WRF Kernels revisited WSM5 Microphysics original optimized original op m. (single precision) KNC 11 GF/s 154 GF/s 14x KNL 68 GF/s 283 GF/s 4.2x KNC KNL 6.2x 1.8x RRTMG Radiation original optimized original op m. (double precision) KNC 10 GF/s 36 GF/s 3.6x KNL 36 GF/s 78 GF/s 2.2x KNC KNL 3.6x 2.2x Michalakes, Iacono and Jessup. Optimizing Weather Model Radiative Transfer Physics for Intel's Many Integrated Core (MIC) Architecture. Parallel Processing Letters. World Scientific. Vol. 26, No. 3 (Sept. 2016) Accepted; publication date TBD. 36

37 Tools overview Profiling Opcount and FP rate Memory traffic and BW Thread utilization Vector utilization Roofline analysis Execution traces Intel Vtune Amplifier By routine, line, and instruction Timeseries B/W output but and traffic counts Histogram General exploration output B/W, CPU time, spintime, MPI comms, as fcn of time Intel Software Development Emulator Only way to count FP ops on current Xeon Count load/store between registers and first level cache Not tried Intel Advisor Not tried Not tried Not tried Routine by routine and line by line Routine by routine (In beta) MPE/Jumpshot (ANL/LANS) By routine, Instrumented region Detailed Gantt plots Empirical Roofline Toolkit (NERSC/LBL) Max FP rate Multiple levels of memory hierarchy Yes Linux Perf PAPI or other Possible Possible Count load/store between LLC and Mem. Possible Possible 37

38 Summary New models such as MPAS and NUMA/NEPTUNE Higher order, variable resolution Demonstrated scalability Focus is on computational efficiency to run faster Progress with accelerators Early Intel Knights Landing results showing 1-2% of peak Good news: with new MCDRAM, we aren t memory bandwidth bound Need to improve vector utilization, other bottlenecks to reach > 10% of peak Naval Postgraduate School s results speeding up NUMA show promise Ongoing efforts Measurement, characterization and optimization of floating point and data movement Explore effect of different workloads, more aggressive compiler optimizations, reduced (single) precision 38

39 Acknowledgements Intel: Mike Greenfield, Ruchira Sasanka, Indraneil Gokhale, Larry Meadows, Zakhar Matveev, Dmitry Petunin, Dmitry Dontsov, Naik Sumedh, Karthik Raman NRL and NPS: Alex Reinecke, Kevin Viner, Jim Doyle, Sasa Gabersek, Andreas Mueller (now at ECMWF), Frank Giraldo NCAR: Bill Skamarock, Michael Duda, John Dennis, Chris Kerr NOAA: Mark Govett, Jim Rosinski, Tom Henderson (now at Spire) Google search, which usually leads to something John McAlpin has already figured out... 39

Towards Exascale Computing with the Atmospheric Model NUMA

Towards Exascale Computing with the Atmospheric Model NUMA Andreas Müller, Daniel S. Abdi, Michal Kopera, Lucas Wilcox, Francis X. Giraldo Department of Applied Mathematics Naval Postgraduate School, Monterey