Towards Exascale Computing with the Atmospheric Model NUMA

Size: px

Start display at page:

Download "Towards Exascale Computing with the Atmospheric Model NUMA"

Avis Pierce
5 years ago
Views:

1 Towards Exascale Computing with the Atmospheric Model NUMA Andreas Müller, Daniel S. Abdi, Michal Kopera, Lucas Wilcox, Francis X. Giraldo Department of Applied Mathematics Naval Postgraduate School, Monterey (California), USA Tim Warburton Virginia Tech Blacksburg (Virginia), USA

2 NUMA = Non-hydrostatic Unified Model of the Atmosphere dynamical core inside the Navy s next generation weather prediction system NEPTUNE (Navy s Environment Prediction system Using the Numa Engine) developed by Prof. Francis X. Giraldo and generations of postdocs

3 Goal NOAA: HIWPP project plan: Goal for 2020: ~ 3km 3.5km global resolution within operational requirements

4 Goal NOAA: HIWPP project plan: Goal for 2020: ~ 3km 3.5km global resolution within operational requirements Achieved with NUMA: baroclinic wave test case at 3.0km within 4.15 minutes per one day forecast on supercomputer Mira double precision, no shallow atmosphere approx., arbitrary terrain, IMEX in the vertical

5 Goal NOAA: HIWPP project plan: Goal for 2020: ~ 3km 3.5km global resolution within operational requirements Achieved with NUMA: baroclinic wave test case at 3.0km within 4.15 minutes per one day forecast on supercomputer Mira double precision, no shallow atmosphere approx., arbitrary terrain, IMEX in the vertical Expect: 2km by doing more optimizations

6 Communication between processors 4th order Finite Difference

7 Communication between processors 4th order Finite Difference CPU 1 CPU 2

8 Communication between processors 4th order Finite Difference compact stencil methods (like in NUMA) CPU 1 CPU 2

9 Communication between processors 4th order Finite Difference compact stencil methods (like in NUMA) CPU 1 CPU 2 CPU 1 CPU 2

10 Communication between processors 4th order Finite Difference compact stencil methods (like in NUMA) NUMA should scale extremely well CPU 1 CPU 2 CPU 1 CPU 2

11 Fastest Supercomputers of the World according to top500.org # NAME COUNTRY TYPE PEAK (PFLOPS) 1 TaihuLight China Sunway Tianhe-2 China Intel Xeon Phi Titan USA NVIDIA GPU Sequoia USA IBM BlueGene K computer Japan Fujitsu CPU Mira USA IBM BlueGene PFlops = floating point ops. per sec.

12 Fastest Supercomputers of the World according to top500.org # NAME COUNTRY TYPE PEAK (PFLOPS) 1 TaihuLight China Sunway Tianhe-2 China Intel Xeon Phi Titan USA NVIDIA GPU Sequoia USA IBM BlueGene K computer Japan Fujitsu CPU Mira USA IBM BlueGene PFlops = floating point ops. per sec.

13 overview Mira: Blue Gene strategy: optimize one setup for one specific computer Titan: GPUs strategy: keep all options, portability on CPUs and GPUs Intel Knights Landing

14 Mira Titan KNL Mira: Blue Gene optimize one setup for this specific computer

15 Mira Titan KNL Which convergence order to use? L2-error of pot. temperature for 2D rising bubble with viscosity µ=0.1m 2 /s 10 0 x: average distance between grid points 10-1 L 2 -error of 3 in K convergence order = p + 1 p=3 p=4 p=5 p=6 p=7 p=8 p=9 p= "x / m

16 Mira Titan KNL Which convergence order to use? L2-error of pot. temperature for 2D rising bubble with viscosity µ=0.1m 2 /s x: average distance between grid points L: smallest scale of the flow L 2 -error of 3 in K x << L x L x >> L convergence order = p + 1 p=3 p=4 p=5 p=6 p=7 p=8 p=9 p= "x / m

17 Mira Titan KNL Which convergence order to use? L2-error of pot. temperature for 2D rising bubble with viscosity µ=0.1m 2 /s x: average distance between grid points L: smallest scale of the flow L 2 -error of 3 in K x >> L x << L x L different polynomial orders should be convergence order = p + 1 compared at same x p=3 p=4 p=5 p=6 p=7 p=8 p=9 p= "x / m

18 pmatrix apply_boundary_conditions E-05 9*nboun b filter E-06 nvar*npoin Which convergence order to use? b filter E-06 nvar*npoin createrhs theoretical performance DRAM in GBmodel for fixed x timeloop a) 350 b) time to solution in seconds time to solution in seconds DG CG CG/DG 350 a) b) DG 300 CG convergence order 2a) b) convergence order floating point operations: current CG storage OpenMP version with vector intrinsics function +, -, *, / ** (power) times called per step flops per call total GFlops flops per DG point and call total GFlops measured t timeloop E ti_rk35 19*npoin E compute_pressure 7*npoin 1*npoin E create_rhs_volume ((nvar+1)*3*(6+2*(nop+1))+10+3*16+(nvar-4)*5+nvar +nvar)*npoindg E metrics (17+4*9+3+3*2*(nop+1))*npoinDG kvector 0 for case create_global_rhs nvar*(nprocboun+npoin) E apply_boundary_conditions nboun*3* E filter (3*2*(nop+1)+2)*(nop+1)^3*nvar*nelem E total ime in seconds CG/DG CG storage CG/DG DG runtime per timestep in seconds r timestep in seconds convergence order 0.50 Mira Titan KNL 0.75 runtime per timestep CG storage CG/DG DG DG CG CG/DG ime in seconds runtime per timestep in seconds DG CG storage, m CG storage, a

19 Measurements: roofline plot rising bubble test case on Mira 10 3 attainable timeloop createrhs GFlops/s per node GFlops per node peak bandwidth 9.8 peak GFlops goal blue: entire timeloop red: main computational kernel data points: different optimization stages arithmetic operational intensity (Flops/Bytes) (Flops/Byte) Mira Titan KNL

20 Mira Titan KNL Strong scaling with NUMA 1.8 billion grid points (3.0km horizontal, 31 levels vertical) 450 baroclinic instability, p=3, 3.0km horizontal resolution model days per wallclock day % strong scaling efficiency M threads number of threads #10 6 #10 6 3

21 Mira Titan KNL 12.1% of theoretical peak (flops) baroclinic instability, p=3, 3.0km horizontal resolution total filter createrhs IMEX 1.21 PFlops on % of peak flops 8 6 entire Mira number of threads

22 Mira Titan KNL Where are we heading? dynamics within 4.5 minutes runtime per one day forecast 10 7 baroclinic instability, p=3, 31 vertical layers, 128 columns per node 10 6 number of threads 10 5 current version resolution in km

23 Mira Titan KNL Where are we heading? dynamics within 4.5 minutes runtime per one day forecast 10 7 baroclinic instability, p=3, 31 vertical layers, 128 columns per node 10 6 current number of threads 10 5 Mira fully version 10 4 optimized resolution in km

24 Mira Titan KNL Where are we heading? dynamics within 4.5 minutes runtime per one day forecast 10 7 baroclinic instability, p=3, 31 vertical layers, 128 columns per node 10 6 number of threads Aurora 2018? Mira fully optimized current version resolution in km

25 Mira Titan KNL Titan: GPUs keep all options, portability on CPUs and GPUs

26 Mira Titan KNL Titan 18,688 CPU nodes (16 cores each) network 18,688 NVIDIA GPUs (2,688 CUDA cores each)

27 OCCA2: unified threading model (slide: courtesy of Lucas Wilcox and Tim Warburton) OCCA2: unified threading model OCCA2: unified threading model Portability & extensibility: device independent kernel language (or OpenCL / CUDA) and native host APIs. Portability & extensibility: device independent kernel language and native host APIs. Host Language # pthreads CPU Processors ++ Xeon Phi Intel COI AMD GPUs Kernel Language OCCA FPGAs OKL CUDA-x86 NVIDIA GPUs Floopy Available at: Back ends Devices 23 Mira Titan KNL

28 Floopy/OCCA2 code (slide: courtesy of Lucas Wilcox and Tim Warburton) Mira Titan KNL

29 Floopy/OCCA2 code (slide: courtesy of Lucas Wilcox and Tim Warburton) Mira Titan KNL

30 Mira Titan KNL Weak scaling up to 16,384 GPUs (88% of Titan) using 900 elements per GPU, polynomial order 8 (up to 10.7 billion grid points) Efficiency(%) % weak scaling efficiency using DG with communication overlap CG nooverlap DG nooverlap DG overlap No of GPUs

31 Mira Titan KNL roofline plot: single precision on the NVIDIA K20X GPU on Titan Horizontal volume kernel Vertical volume kernel Update kernel Project kernel Roofline 3520 GFLOPS/s GFLOPS/s GB/s 4 12 convergence order GFLOPS/GB

32 Mira Titan KNL roofline plot: double precision on the NVIDIA K20X GPU on Titan Horizontal volume kernel Vertical volume kernel Update kernel Project kernel Roofline 1170 GFLOPS/s GFLOPS/s GB/s convergence order GFLOPS/GB

33 Mira Titan KNL some more results from Titan GPU: up to 15 times faster than original version on one CPU node for single precision (CAM-SE and other models: less than 5x speed-up) CPU: single CPU node GPU precision about 30% faster than 16-core AMD Opteron 6274 NVIDIA K20X double peak performance 141 GFlops/s 1.31 TFlops/s GPU: single about 2x faster peak bandwidth 32 GB/s 208 GB/s than double peak energy 115 W 235 W

34 Intel Knights Landing Mira Titan KNL

35 Mira Titan KNL Knights Landing Why is this so exciting? 64 cores (Mira: 64 hardware threads) each Knights Landing chip has 16GB MCDRAM with 440GB/s memory bandwidth more than 15 times faster than Mira Aurora (successor of Mira, expected in 2018): more than 50k Knights Landing nodes more nodes than Mira overall: Aurora should allow us to run the same simulations like on Mira but 15 times faster

36 Mira Titan KNL Knights Landing: roofline plot OCCA version of NUMA GFLOPS/s GB/s GFLOPS/s Volume kernel Diffusion kernel Update Kernel of ARK create_rhs_gmres_no_schur create_lhs_gmres_no_schur_set2c extract_q_gmres_no_schur Roofline GFLOPS/GB

37 Mira Titan KNL Knights Landing: weak scaling work in progress Scaling efficiency (%) Number of KNL nodes

38 Mira Titan KNL Summary Mira: 3.0km baroclinic instability within 4.15 minutes runtime per one day forecast 99.1% strong scaling efficiency on Mira, 1.21 PFlops Titan: 90% weak scaling efficiency on 16,384 GPUs GPU: runs up to 15x faster than on one CPU node Intel Knights Landing: looks good but still room for improvement

39 Thank you for your attention!

NPS-NRL-Rice-UIUC Collaboration on Navy Atmosphere-Ocean Coupled Models on Many-Core Computer Architectures Annual Report

DISTRIBUTION STATEMENT A: Distribution approved for public release; distribution is unlimited. NPS-NRL-Rice-UIUC Collaboration on Navy Atmosphere-Ocean Coupled Models on Many-Core Computer Architectures