Towards Exascale Computing with the Atmospheric Model NUMA Andreas Müller, Daniel S. Abdi, Michal Kopera, Lucas Wilcox, Francis X. Giraldo Department of Applied Mathematics Naval Postgraduate School, Monterey (California), USA Tim Warburton Virginia Tech Blacksburg (Virginia), USA
NUMA = Non-hydrostatic Unified Model of the Atmosphere dynamical core inside the Navy s next generation weather prediction system NEPTUNE (Navy s Environment Prediction system Using the Numa Engine) developed by Prof. Francis X. Giraldo and generations of postdocs
Goal NOAA: HIWPP project plan: Goal for 2020: ~ 3km 3.5km global resolution within operational requirements
Goal NOAA: HIWPP project plan: Goal for 2020: ~ 3km 3.5km global resolution within operational requirements Achieved with NUMA: baroclinic wave test case at 3.0km within 4.15 minutes per one day forecast on supercomputer Mira double precision, no shallow atmosphere approx., arbitrary terrain, IMEX in the vertical
Goal NOAA: HIWPP project plan: Goal for 2020: ~ 3km 3.5km global resolution within operational requirements Achieved with NUMA: baroclinic wave test case at 3.0km within 4.15 minutes per one day forecast on supercomputer Mira double precision, no shallow atmosphere approx., arbitrary terrain, IMEX in the vertical Expect: 2km by doing more optimizations
Communication between processors 4th order Finite Difference
Communication between processors 4th order Finite Difference CPU 1 CPU 2
Communication between processors 4th order Finite Difference compact stencil methods (like in NUMA) CPU 1 CPU 2
Communication between processors 4th order Finite Difference compact stencil methods (like in NUMA) CPU 1 CPU 2 CPU 1 CPU 2
Communication between processors 4th order Finite Difference compact stencil methods (like in NUMA) NUMA should scale extremely well CPU 1 CPU 2 CPU 1 CPU 2
Fastest Supercomputers of the World according to top500.org # NAME COUNTRY TYPE PEAK (PFLOPS) 1 TaihuLight China Sunway 125.4 2 Tianhe-2 China Intel Xeon Phi 54.9 3 Titan USA NVIDIA GPU 27.1 4 Sequoia USA IBM BlueGene 20.1 5 K computer Japan Fujitsu CPU 11.2 6 Mira USA IBM BlueGene 10.0 1 PFlops = 10 15 floating point ops. per sec.
Fastest Supercomputers of the World according to top500.org # NAME COUNTRY TYPE PEAK (PFLOPS) 1 TaihuLight China Sunway 125.4 2 Tianhe-2 China Intel Xeon Phi 54.9 3 Titan USA NVIDIA GPU 27.1 4 Sequoia USA IBM BlueGene 20.1 5 K computer Japan Fujitsu CPU 11.2 6 Mira USA IBM BlueGene 10.0 1 PFlops = 10 15 floating point ops. per sec.
overview Mira: Blue Gene strategy: optimize one setup for one specific computer Titan: GPUs strategy: keep all options, portability on CPUs and GPUs Intel Knights Landing
Mira Titan KNL Mira: Blue Gene optimize one setup for this specific computer
Mira Titan KNL Which convergence order to use? L2-error of pot. temperature for 2D rising bubble with viscosity µ=0.1m 2 /s 10 0 x: average distance between grid points 10-1 L 2 -error of 3 in K 10-2 10-3 10-4 convergence order = p + 1 p=3 p=4 p=5 p=6 p=7 p=8 p=9 p=10 10 0 10 1 10 2 "x / m
Mira Titan KNL Which convergence order to use? L2-error of pot. temperature for 2D rising bubble with viscosity µ=0.1m 2 /s 10 0 10-1 x: average distance between grid points L: smallest scale of the flow L 2 -error of 3 in K 10-2 10-3 10-4 x << L x L x >> L convergence order = p + 1 p=3 p=4 p=5 p=6 p=7 p=8 p=9 p=10 10 0 10 1 10 2 "x / m
Mira Titan KNL Which convergence order to use? L2-error of pot. temperature for 2D rising bubble with viscosity µ=0.1m 2 /s 10 0 10-1 x: average distance between grid points L: smallest scale of the flow L 2 -error of 3 in K 10-2 10-3 10-4 x >> L x << L x L different polynomial orders should be convergence order = p + 1 compared at same x p=3 p=4 p=5 p=6 p=7 p=8 p=9 p=10 10 0 10 1 10 2 "x / m
pmatrix apply_boundary_conditions 64 5 2.8E-05 9*nboun 1 0 54467.39300101141.5 1.5 0.0 b filter 64 1 5.5E-06 nvar*npoin 1 1 4802000 26.5 31.8 26.5 3 Which convergence order to use? b filter 64 1 5.5E-06 nvar*npoin 1 0 4802000 26.5 0.0 0.0 createrhs 44197488 1222.2 1287.9 132.5 15 theoretical performance DRAM in GBmodel for fixed x timeloop 149710351.190668 2129.4 1353.6 661.8 19 a) 350 b) time to solution in seconds 300 250 200 150 100 50 0 time to solution in seconds 250 200 150 100 50 0 DG CG CG/DG 350 a) b) DG 300 CG 2 3 4 5 6 7 8 9 10 convergence order 2a) 280 3 4 5 6b) 1.00 7 8 9 10 convergence order floating point operations: current CG storage OpenMP version with vector intrinsics function +, -, *, / ** (power) times called per step flops per call total GFlops flops per DG point and call total GFlops measured t timeloop 0 0 1 0.00E+00 0.00 0.00 ti_rk35 19*npoin 0 1 1.82E+07 12.59 8.70 compute_pressure 7*npoin 1*npoin 5 6.72E+06 23.19 3.21 create_rhs_volume ((nvar+1)*3*(6+2*(nop+1))+10+3*16+(nvar-4)*5+nvar +nvar)*npoindg 0 5 6.82E+08 2351.43 325.00 2503.00 metrics (17+4*9+3+3*2*(nop+1))*npoinDG 0.00 0.00 kvector 0 for case 21 0.00 0.00 create_global_rhs nvar*(nprocboun+npoin) 0 6 5.56E+06 23.01 2.65 apply_boundary_conditions nboun*3*5 0 6 9.08E+04 0.38 0.04 filter (3*2*(nop+1)+2)*(nop+1)^3*nvar*nelem 0 1 2.73E+08 188.11 130.00 166.00 total 2598.72 469.60 3007.00 ime in seconds 210 140 CG/DG CG storage CG/DG DG runtime per timestep in seconds r timestep in seconds 1.0 0.8 0.6 0.4 0.2 0.0 2 3 4 5 6 7 8 9 10 convergence order 0.50 Mira Titan KNL 0.75 runtime per timestep CG storage CG/DG DG DG CG CG/DG ime in seconds 2000 1500 1000 runtime per timestep in seconds 1 0 0 0 0 0 DG CG storage, m CG storage, a
Measurements: roofline plot rising bubble test case on Mira 10 3 attainable timeloop createrhs GFlops/s per node GFlops per node 10 2 10 1 5.5 peak bandwidth 9.8 peak GFlops goal blue: entire timeloop red: main computational kernel data points: different optimization stages 10 0 10-1 10 0 10 1 10 2 arithmetic operational intensity (Flops/Bytes) (Flops/Byte) Mira Titan KNL
Mira Titan KNL Strong scaling with NUMA 1.8 billion grid points (3.0km horizontal, 31 levels vertical) 450 baroclinic instability, p=3, 3.0km horizontal resolution 400 350 model days per wallclock day 300 250 200 150 99.1% strong scaling efficiency 100 50 3.14M threads 0 0 0.5 1 1.5 2 2.5 3 3.5 4 number of threads #10 6 #10 6 3
Mira Titan KNL 12.1% of theoretical peak (flops) baroclinic instability, p=3, 3.0km horizontal resolution 12 10 total filter createrhs IMEX 1.21 PFlops on % of peak flops 8 6 entire Mira 4 2 0 10 4 10 5 10 6 10 7 number of threads
Mira Titan KNL Where are we heading? dynamics within 4.5 minutes runtime per one day forecast 10 7 baroclinic instability, p=3, 31 vertical layers, 128 columns per node 10 6 number of threads 10 5 current version 10 4 10 3 1 2 3 5 10 20 30 50 100 resolution in km
Mira Titan KNL Where are we heading? dynamics within 4.5 minutes runtime per one day forecast 10 7 baroclinic instability, p=3, 31 vertical layers, 128 columns per node 10 6 current number of threads 10 5 Mira fully version 10 4 optimized 10 3 1 2 3 5 10 20 30 50 100 resolution in km
Mira Titan KNL Where are we heading? dynamics within 4.5 minutes runtime per one day forecast 10 7 baroclinic instability, p=3, 31 vertical layers, 128 columns per node 10 6 number of threads 10 5 10 4 Aurora 2018? Mira fully optimized current version 10 3 1 2 3 5 10 20 30 50 100 resolution in km
Mira Titan KNL Titan: GPUs keep all options, portability on CPUs and GPUs
Mira Titan KNL Titan 18,688 CPU nodes (16 cores each) network 18,688 NVIDIA GPUs (2,688 CUDA cores each)
OCCA2: unified threading model (slide: courtesy of Lucas Wilcox and Tim Warburton) OCCA2: unified threading model OCCA2: unified threading model Portability & extensibility: device independent kernel language (or OpenCL / CUDA) and native host APIs. Portability & extensibility: device independent kernel language and native host APIs. Host Language # pthreads CPU Processors ++ Xeon Phi Intel COI AMD GPUs Kernel Language OCCA FPGAs OKL CUDA-x86 NVIDIA GPUs Floopy Available at: https://github.com/tcew/occa2 Back ends Devices 23 Mira Titan KNL
Floopy/OCCA2 code (slide: courtesy of Lucas Wilcox and Tim Warburton) Mira Titan KNL
Floopy/OCCA2 code (slide: courtesy of Lucas Wilcox and Tim Warburton) Mira Titan KNL
Mira Titan KNL Weak scaling up to 16,384 GPUs (88% of Titan) using 900 elements per GPU, polynomial order 8 (up to 10.7 billion grid points) 100 90 80 70 Efficiency(%) 60 50 40 90% weak scaling efficiency using DG with communication overlap 30 20 10 CG nooverlap DG nooverlap DG overlap 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 No of GPUs
Mira Titan KNL roofline plot: single precision on the NVIDIA K20X GPU on Titan 10 4 10 3 Horizontal volume kernel Vertical volume kernel Update kernel Project kernel Roofline 3520 GFLOPS/s GFLOPS/s 10 2 208 GB/s 4 12 convergence order 10 1 2 10 0 10 1 10 0 10 1 10 2 GFLOPS/GB
Mira Titan KNL roofline plot: double precision on the NVIDIA K20X GPU on Titan 10 4 10 3 Horizontal volume kernel Vertical volume kernel Update kernel Project kernel Roofline 1170 GFLOPS/s GFLOPS/s 10 2 208 GB/s 12 10 1 2 4 convergence order 10 0 10 1 10 0 10 1 10 2 GFLOPS/GB
Mira Titan KNL some more results from Titan GPU: up to 15 times faster than original version on one CPU node for single precision (CAM-SE and other models: less than 5x speed-up) CPU: single CPU node GPU precision about 30% faster than 16-core AMD Opteron 6274 NVIDIA K20X double peak performance 141 GFlops/s 1.31 TFlops/s GPU: single about 2x faster peak bandwidth 32 GB/s 208 GB/s than double peak energy 115 W 235 W
Intel Knights Landing Mira Titan KNL
Mira Titan KNL Knights Landing Why is this so exciting? 64 cores (Mira: 64 hardware threads) each Knights Landing chip has 16GB MCDRAM with 440GB/s memory bandwidth more than 15 times faster than Mira Aurora (successor of Mira, expected in 2018): more than 50k Knights Landing nodes more nodes than Mira overall: Aurora should allow us to run the same simulations like on Mira but 15 times faster
Mira Titan KNL Knights Landing: roofline plot OCCA version of NUMA 10 4 3000 GFLOPS/s 10 3 440 GB/s GFLOPS/s 10 2 10 1 10 0 10 1 Volume kernel Diffusion kernel Update Kernel of ARK create_rhs_gmres_no_schur create_lhs_gmres_no_schur_set2c extract_q_gmres_no_schur Roofline 10 3 10 2 10 1 10 0 10 1 10 2 10 3 GFLOPS/GB
Mira Titan KNL Knights Landing: weak scaling work in progress 100 90 80 Scaling efficiency (%) 70 60 50 40 30 20 10 0 2 4 6 8 10 12 14 16 Number of KNL nodes
Mira Titan KNL Summary Mira: 3.0km baroclinic instability within 4.15 minutes runtime per one day forecast 99.1% strong scaling efficiency on Mira, 1.21 PFlops Titan: 90% weak scaling efficiency on 16,384 GPUs GPU: runs up to 15x faster than on one CPU node Intel Knights Landing: looks good but still room for improvement
Thank you for your attention!