Towards Exascale Computing with the Atmospheric Model NUMA

Similar documents
NPS-NRL-Rice-UIUC Collaboration on Navy Atmosphere-Ocean Coupled Models on Many-Core Computer Architectures Annual Report

Final Report for the NPS-NRL-Rice-UIUC Collaboration on Navy Atmosphere-Ocean Coupled Models on Many-Core Computer Architectures

Scalable Nonhydrostatic Unified Models of the Atmosphere with Moisture

NPS-NRL-Rice-UIUC Collaboration on Navy Atmosphere-Ocean Coupled Models on Many-Core Computer Architectures Annual Report

Development and Testing of a Next Generation Spectral Element Model for the US Navy

CME 213 S PRING Eric Darve

Trends in HPC (hardware complexity and software challenges)

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

A Multiscale Non-hydrostatic Atmospheric Model for Regional and Global Applications

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

HPC future trends from a science perspective

Running the FIM and NIM Weather Models on GPUs

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Performance of deal.ii on a node

Experiences Programming and Optimizing for Knights Landing

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

High Performance Linear Algebra

MANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Digital Signal Processor Supercomputing

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Parallel Computing. November 20, W.Homberg

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Umeå University

Umeå University

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Mathematical computations with GPUs

Developing NEPTUNE for U.S. Naval Weather Prediction

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

Parallel computing and GPU introduction

Introduction to GPU hardware and to CUDA

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

Timothy Lanfear, NVIDIA HPC

Fast-multipole algorithms moving to Exascale

Simulating Stencil-based Application on Future Xeon-Phi Processor

Software and Performance Engineering for numerical codes on GPU clusters

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Hybrid Architectures Why Should I Bother?

Early Experiences Writing Performance Portable OpenMP 4 Codes

OP2 FOR MANY-CORE ARCHITECTURES

Introduction to tuning on KNL platforms

IHK/McKernel: A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC

Intel Xeon Phi Coprocessors

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor

PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent

Large scale Imaging on Current Many- Core Platforms

Memory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

High-Performance Computing - and why Learn about it?

Heterogeneous Computing and OpenCL

Master Informatics Eng.

Parallel Programming

Interconnect Your Future

GPU Architecture. Alan Gray EPCC The University of Edinburgh

HPC and Big Data: Updates about China. Haohuan FU August 29 th, 2017

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

PORTABLE AND SCALABLE SOLUTIONS FOR CFD ON MODERN SUPERCOMPUTERS

HPC Cineca Infrastructure: State of the art and towards the exascale

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

Introduction to tuning on KNL platforms

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer

Performance Optimization Strategies for WRF Physics Schemes Used in Weather Modeling

arxiv: v2 [hep-lat] 3 Nov 2016

HPC Architectures. Types of resource currently in use

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

The Future of High- Performance Computing

arxiv: v1 [hep-lat] 1 Dec 2017

Report on the Sunway TaihuLight System. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory

GPU Consideration for Next Generation Weather (and Climate) Simulations

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core

Radial Basis Function-Generated Finite Differences (RBF-FD): New Opportunities for Applications in Scientific Computing

A GPU performance estimation model based on micro-benchmarks and black-box kernel profiling

EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière

The Era of Heterogeneous Computing

ICON for HD(CP) 2. High Definition Clouds and Precipitation for Advancing Climate Prediction

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL:

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Exploring Emerging Technologies in the Extreme Scale HPC Co- Design Space with Aspen

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

A performance portable implementation of HOMME via the Kokkos programming model

Pedraforca: a First ARM + GPU Cluster for HPC

NVIDIA s Compute Unified Device Architecture (CUDA)

Transcription:

Towards Exascale Computing with the Atmospheric Model NUMA Andreas Müller, Daniel S. Abdi, Michal Kopera, Lucas Wilcox, Francis X. Giraldo Department of Applied Mathematics Naval Postgraduate School, Monterey (California), USA Tim Warburton Virginia Tech Blacksburg (Virginia), USA

NUMA = Non-hydrostatic Unified Model of the Atmosphere dynamical core inside the Navy s next generation weather prediction system NEPTUNE (Navy s Environment Prediction system Using the Numa Engine) developed by Prof. Francis X. Giraldo and generations of postdocs

Goal NOAA: HIWPP project plan: Goal for 2020: ~ 3km 3.5km global resolution within operational requirements

Goal NOAA: HIWPP project plan: Goal for 2020: ~ 3km 3.5km global resolution within operational requirements Achieved with NUMA: baroclinic wave test case at 3.0km within 4.15 minutes per one day forecast on supercomputer Mira double precision, no shallow atmosphere approx., arbitrary terrain, IMEX in the vertical

Goal NOAA: HIWPP project plan: Goal for 2020: ~ 3km 3.5km global resolution within operational requirements Achieved with NUMA: baroclinic wave test case at 3.0km within 4.15 minutes per one day forecast on supercomputer Mira double precision, no shallow atmosphere approx., arbitrary terrain, IMEX in the vertical Expect: 2km by doing more optimizations

Communication between processors 4th order Finite Difference

Communication between processors 4th order Finite Difference CPU 1 CPU 2

Communication between processors 4th order Finite Difference compact stencil methods (like in NUMA) CPU 1 CPU 2

Communication between processors 4th order Finite Difference compact stencil methods (like in NUMA) CPU 1 CPU 2 CPU 1 CPU 2

Communication between processors 4th order Finite Difference compact stencil methods (like in NUMA) NUMA should scale extremely well CPU 1 CPU 2 CPU 1 CPU 2

Fastest Supercomputers of the World according to top500.org # NAME COUNTRY TYPE PEAK (PFLOPS) 1 TaihuLight China Sunway 125.4 2 Tianhe-2 China Intel Xeon Phi 54.9 3 Titan USA NVIDIA GPU 27.1 4 Sequoia USA IBM BlueGene 20.1 5 K computer Japan Fujitsu CPU 11.2 6 Mira USA IBM BlueGene 10.0 1 PFlops = 10 15 floating point ops. per sec.

Fastest Supercomputers of the World according to top500.org # NAME COUNTRY TYPE PEAK (PFLOPS) 1 TaihuLight China Sunway 125.4 2 Tianhe-2 China Intel Xeon Phi 54.9 3 Titan USA NVIDIA GPU 27.1 4 Sequoia USA IBM BlueGene 20.1 5 K computer Japan Fujitsu CPU 11.2 6 Mira USA IBM BlueGene 10.0 1 PFlops = 10 15 floating point ops. per sec.

overview Mira: Blue Gene strategy: optimize one setup for one specific computer Titan: GPUs strategy: keep all options, portability on CPUs and GPUs Intel Knights Landing

Mira Titan KNL Mira: Blue Gene optimize one setup for this specific computer

Mira Titan KNL Which convergence order to use? L2-error of pot. temperature for 2D rising bubble with viscosity µ=0.1m 2 /s 10 0 x: average distance between grid points 10-1 L 2 -error of 3 in K 10-2 10-3 10-4 convergence order = p + 1 p=3 p=4 p=5 p=6 p=7 p=8 p=9 p=10 10 0 10 1 10 2 "x / m

Mira Titan KNL Which convergence order to use? L2-error of pot. temperature for 2D rising bubble with viscosity µ=0.1m 2 /s 10 0 10-1 x: average distance between grid points L: smallest scale of the flow L 2 -error of 3 in K 10-2 10-3 10-4 x << L x L x >> L convergence order = p + 1 p=3 p=4 p=5 p=6 p=7 p=8 p=9 p=10 10 0 10 1 10 2 "x / m

Mira Titan KNL Which convergence order to use? L2-error of pot. temperature for 2D rising bubble with viscosity µ=0.1m 2 /s 10 0 10-1 x: average distance between grid points L: smallest scale of the flow L 2 -error of 3 in K 10-2 10-3 10-4 x >> L x << L x L different polynomial orders should be convergence order = p + 1 compared at same x p=3 p=4 p=5 p=6 p=7 p=8 p=9 p=10 10 0 10 1 10 2 "x / m

pmatrix apply_boundary_conditions 64 5 2.8E-05 9*nboun 1 0 54467.39300101141.5 1.5 0.0 b filter 64 1 5.5E-06 nvar*npoin 1 1 4802000 26.5 31.8 26.5 3 Which convergence order to use? b filter 64 1 5.5E-06 nvar*npoin 1 0 4802000 26.5 0.0 0.0 createrhs 44197488 1222.2 1287.9 132.5 15 theoretical performance DRAM in GBmodel for fixed x timeloop 149710351.190668 2129.4 1353.6 661.8 19 a) 350 b) time to solution in seconds 300 250 200 150 100 50 0 time to solution in seconds 250 200 150 100 50 0 DG CG CG/DG 350 a) b) DG 300 CG 2 3 4 5 6 7 8 9 10 convergence order 2a) 280 3 4 5 6b) 1.00 7 8 9 10 convergence order floating point operations: current CG storage OpenMP version with vector intrinsics function +, -, *, / ** (power) times called per step flops per call total GFlops flops per DG point and call total GFlops measured t timeloop 0 0 1 0.00E+00 0.00 0.00 ti_rk35 19*npoin 0 1 1.82E+07 12.59 8.70 compute_pressure 7*npoin 1*npoin 5 6.72E+06 23.19 3.21 create_rhs_volume ((nvar+1)*3*(6+2*(nop+1))+10+3*16+(nvar-4)*5+nvar +nvar)*npoindg 0 5 6.82E+08 2351.43 325.00 2503.00 metrics (17+4*9+3+3*2*(nop+1))*npoinDG 0.00 0.00 kvector 0 for case 21 0.00 0.00 create_global_rhs nvar*(nprocboun+npoin) 0 6 5.56E+06 23.01 2.65 apply_boundary_conditions nboun*3*5 0 6 9.08E+04 0.38 0.04 filter (3*2*(nop+1)+2)*(nop+1)^3*nvar*nelem 0 1 2.73E+08 188.11 130.00 166.00 total 2598.72 469.60 3007.00 ime in seconds 210 140 CG/DG CG storage CG/DG DG runtime per timestep in seconds r timestep in seconds 1.0 0.8 0.6 0.4 0.2 0.0 2 3 4 5 6 7 8 9 10 convergence order 0.50 Mira Titan KNL 0.75 runtime per timestep CG storage CG/DG DG DG CG CG/DG ime in seconds 2000 1500 1000 runtime per timestep in seconds 1 0 0 0 0 0 DG CG storage, m CG storage, a

Measurements: roofline plot rising bubble test case on Mira 10 3 attainable timeloop createrhs GFlops/s per node GFlops per node 10 2 10 1 5.5 peak bandwidth 9.8 peak GFlops goal blue: entire timeloop red: main computational kernel data points: different optimization stages 10 0 10-1 10 0 10 1 10 2 arithmetic operational intensity (Flops/Bytes) (Flops/Byte) Mira Titan KNL

Mira Titan KNL Strong scaling with NUMA 1.8 billion grid points (3.0km horizontal, 31 levels vertical) 450 baroclinic instability, p=3, 3.0km horizontal resolution 400 350 model days per wallclock day 300 250 200 150 99.1% strong scaling efficiency 100 50 3.14M threads 0 0 0.5 1 1.5 2 2.5 3 3.5 4 number of threads #10 6 #10 6 3

Mira Titan KNL 12.1% of theoretical peak (flops) baroclinic instability, p=3, 3.0km horizontal resolution 12 10 total filter createrhs IMEX 1.21 PFlops on % of peak flops 8 6 entire Mira 4 2 0 10 4 10 5 10 6 10 7 number of threads

Mira Titan KNL Where are we heading? dynamics within 4.5 minutes runtime per one day forecast 10 7 baroclinic instability, p=3, 31 vertical layers, 128 columns per node 10 6 number of threads 10 5 current version 10 4 10 3 1 2 3 5 10 20 30 50 100 resolution in km

Mira Titan KNL Where are we heading? dynamics within 4.5 minutes runtime per one day forecast 10 7 baroclinic instability, p=3, 31 vertical layers, 128 columns per node 10 6 current number of threads 10 5 Mira fully version 10 4 optimized 10 3 1 2 3 5 10 20 30 50 100 resolution in km

Mira Titan KNL Where are we heading? dynamics within 4.5 minutes runtime per one day forecast 10 7 baroclinic instability, p=3, 31 vertical layers, 128 columns per node 10 6 number of threads 10 5 10 4 Aurora 2018? Mira fully optimized current version 10 3 1 2 3 5 10 20 30 50 100 resolution in km

Mira Titan KNL Titan: GPUs keep all options, portability on CPUs and GPUs

Mira Titan KNL Titan 18,688 CPU nodes (16 cores each) network 18,688 NVIDIA GPUs (2,688 CUDA cores each)

OCCA2: unified threading model (slide: courtesy of Lucas Wilcox and Tim Warburton) OCCA2: unified threading model OCCA2: unified threading model Portability & extensibility: device independent kernel language (or OpenCL / CUDA) and native host APIs. Portability & extensibility: device independent kernel language and native host APIs. Host Language # pthreads CPU Processors ++ Xeon Phi Intel COI AMD GPUs Kernel Language OCCA FPGAs OKL CUDA-x86 NVIDIA GPUs Floopy Available at: https://github.com/tcew/occa2 Back ends Devices 23 Mira Titan KNL

Floopy/OCCA2 code (slide: courtesy of Lucas Wilcox and Tim Warburton) Mira Titan KNL

Floopy/OCCA2 code (slide: courtesy of Lucas Wilcox and Tim Warburton) Mira Titan KNL

Mira Titan KNL Weak scaling up to 16,384 GPUs (88% of Titan) using 900 elements per GPU, polynomial order 8 (up to 10.7 billion grid points) 100 90 80 70 Efficiency(%) 60 50 40 90% weak scaling efficiency using DG with communication overlap 30 20 10 CG nooverlap DG nooverlap DG overlap 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 No of GPUs

Mira Titan KNL roofline plot: single precision on the NVIDIA K20X GPU on Titan 10 4 10 3 Horizontal volume kernel Vertical volume kernel Update kernel Project kernel Roofline 3520 GFLOPS/s GFLOPS/s 10 2 208 GB/s 4 12 convergence order 10 1 2 10 0 10 1 10 0 10 1 10 2 GFLOPS/GB

Mira Titan KNL roofline plot: double precision on the NVIDIA K20X GPU on Titan 10 4 10 3 Horizontal volume kernel Vertical volume kernel Update kernel Project kernel Roofline 1170 GFLOPS/s GFLOPS/s 10 2 208 GB/s 12 10 1 2 4 convergence order 10 0 10 1 10 0 10 1 10 2 GFLOPS/GB

Mira Titan KNL some more results from Titan GPU: up to 15 times faster than original version on one CPU node for single precision (CAM-SE and other models: less than 5x speed-up) CPU: single CPU node GPU precision about 30% faster than 16-core AMD Opteron 6274 NVIDIA K20X double peak performance 141 GFlops/s 1.31 TFlops/s GPU: single about 2x faster peak bandwidth 32 GB/s 208 GB/s than double peak energy 115 W 235 W

Intel Knights Landing Mira Titan KNL

Mira Titan KNL Knights Landing Why is this so exciting? 64 cores (Mira: 64 hardware threads) each Knights Landing chip has 16GB MCDRAM with 440GB/s memory bandwidth more than 15 times faster than Mira Aurora (successor of Mira, expected in 2018): more than 50k Knights Landing nodes more nodes than Mira overall: Aurora should allow us to run the same simulations like on Mira but 15 times faster

Mira Titan KNL Knights Landing: roofline plot OCCA version of NUMA 10 4 3000 GFLOPS/s 10 3 440 GB/s GFLOPS/s 10 2 10 1 10 0 10 1 Volume kernel Diffusion kernel Update Kernel of ARK create_rhs_gmres_no_schur create_lhs_gmres_no_schur_set2c extract_q_gmres_no_schur Roofline 10 3 10 2 10 1 10 0 10 1 10 2 10 3 GFLOPS/GB

Mira Titan KNL Knights Landing: weak scaling work in progress 100 90 80 Scaling efficiency (%) 70 60 50 40 30 20 10 0 2 4 6 8 10 12 14 16 Number of KNL nodes

Mira Titan KNL Summary Mira: 3.0km baroclinic instability within 4.15 minutes runtime per one day forecast 99.1% strong scaling efficiency on Mira, 1.21 PFlops Titan: 90% weak scaling efficiency on 16,384 GPUs GPU: runs up to 15x faster than on one CPU node Intel Knights Landing: looks good but still room for improvement

Thank you for your attention!