Running the FIM and NIM Weather Models on GPUs

Similar documents
Parallelization of the NIM Dynamical Core for GPUs

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core

CPU GPU. Regional Models. Global Models. Bigger Systems More Expensive Facili:es Bigger Power Bills Lower System Reliability

Successes and Challenges Using GPUs for Weather and Climate Models

Running the NIM Next-Generation Weather Model on GPUs

A 3-D Finite-Volume Nonhydrostatic Icosahedral Model (NIM) Jin Lee

Porting and Tuning WRF Physics Packages on Intel Xeon and Xeon Phi and NVIDIA GPU

Progress Porting and Tuning NWP Physics Packages on Intel Xeon Phi

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Stan Posey, NVIDIA, Santa Clara, CA, USA

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Deutscher Wetterdienst

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

S Comparing OpenACC 2.5 and OpenMP 4.5

Towards Exascale Computing with the Atmospheric Model NUMA

Trends in HPC (hardware complexity and software challenges)

Porting COSMO to Hybrid Architectures

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Timothy Lanfear, NVIDIA HPC

GPU Developments for the NEMO Model. Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA

Cray OpenACC Case Studies

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Experiences with CUDA & OpenACC from porting ACME to GPUs

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC 2.6 Proposed Features

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CUDA. Matthew Joyner, Jeremy Williams

arxiv: v1 [hep-lat] 12 Nov 2013

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

Programming NVIDIA GPUs with OpenACC Directives

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

A case study of performance portability with OpenMP 4.5

Optimizing Weather Model Radiative Transfer Physics for the Many Integrated Core and GPGPU Architectures

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Application of KGen and KGen-kernel

Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

COMP Parallel Computing. Programming Accelerators using Directives

John Levesque Nov 16, 2001

GPU-Powered WRF in the Cloud for Research and Operational Applications

Steve Scott, Tesla CTO SC 11 November 15, 2011

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

Accelerating Harmonie with GPUs (or MICs)

SENSEI / SENSEI-Lite / SENEI-LDC Updates

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC

Porting the ICON Non-hydrostatic Dynamics and Physics to GPUs

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Data Assimilation on future computer architectures

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

OpenACC (Open Accelerators - Introduced in 2012)

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Parallel Computing. November 20, W.Homberg

Addressing Heterogeneity in Manycore Applications

OpenACC programming for GPGPUs: Rotor wake simulation

Directive-based Programming for Highly-scalable Nodes

Technology for a better society. hetcomp.com

Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015

Can Accelerators Really Accelerate Harmonie?

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

An Introduction to the SPEC High Performance Group and their Benchmark Suites

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Parallel Programming. Libraries and Implementations

Peter Messmer Developer Technology Group Stan Posey HPC Industry and Applications

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Accelerating Financial Applications on the GPU

Mapping MPI+X Applications to Multi-GPU Architectures

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

CME 213 S PRING Eric Darve

GPU-optimized computational speed-up for the atmospheric chemistry box model from CAM4-Chem

Trends and Challenges in Multicore Programming

GPUs and Emerging Architectures

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

OP2 FOR MANY-CORE ARCHITECTURES

Accelerator programming with OpenACC

The Era of Heterogeneous Computing

Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms

High Performance Computing with Accelerators

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Understanding Dynamic Parallelism

Execution Models for the Exascale Era

Transcription:

Running the FIM and NIM Weather Models on GPUs Mark Govett Tom Henderson, Jacques Middlecoff, Jim Rosinski, Paul Madden NOAA Earth System Research Laboratory

Global Models 0 to 14 days 10 to 30 KM resolution 1000 CPU cores Hurricane Sandy NOAA FIM (and other) models accurately predict hurricane track 5 days ahead

Regional Models 0 to 48 hours 1 to 3 KM resolution 1000 CPU cores Hurricane Sandy HRRR model consistently predicted gusts above 70 knots from the southeast over the New York area up to 15 hours in advance.

Global Cloud Resolving Models 2-4 KM resolution Minimum of 80,000 CPUs to run ~50 percent of real-time Need to get to 1-2 percent of real time for NWS Model Developments NIM: NOAA ESRL USA MPAS: NCAR USA NICAM: JAMSTEC Japan ICON DWD Germany CubeSphere NASA USA Icosahedral Grid Other efforts under early development world wide Key ingredient is massive amounts of computing Estimated 200,000 to 300,000 CPU cores to be useful

F2C-ACC Fortran GPU Compiler Developed in 2009, before commercial compilers were available Used to parallelize NIM & FIM dynamics Focus on minimizing changes to preserve original code Single source to preserve performance portability Run on CPU, GPU, serial, parallel Advanced capabilities to preserve single source, improve performance Used to evaluate commercial compilers 2011: evaluated CAPS, PGI (Henderson) 2012: shared stand-alone tests with all the vendors performance, correctness, language support Plan another evaluation in 2013

Why is single source so Important? Models are increasingly complex Modeling Systems Significant and cost effort to develop and maintain over lifecycle Must be performance portable & interoperable Support research and operational use Super Computers NCEP Operations NOAA @ Oak Ridge E N S E M B L E S HYCOM WRF GFS FIM NIM physics dynamics chemistry Computing systems are becoming more diverse Intel, AMD, IBM-Power, + GPU, Intel MIC, +? Interoperabilty DoE Titan 20 PFlops (2013) NOAA @ West Virginia NOAA @ Boulder NCAR MIC Cluster TACC (2013)

OpenMP / F2C-ACC GPU Kernel Placement of directives are generally in the same location for GPU directives & OMP OMP: Need to have sufficient work to overcome startup overhead Data movement is implicit F2C accelerator model assumes data resides on CPU!$OMP PARALLEL DO PRIVATE (k, ) SCHEDULE (runtime)!acc$region(<nvl>,<ime-ims+1>) BEGIN!ACC$DO PARALLEL(1) do ipn=ips,ipe!acc$do VECTOR(1) do k=1,nvl worka(k,ipn) = tr3d(k,ipn,1) end do end do!acc$region END!OMP END PARALLEL DO

OpenACC GPU Kernel Unclear how OpenACC compilers handles data movement Explicitly listing each variable for each region can be tedious particularly if the region is big and complicated Unclear how if OpenACC compilers support Fortran 90 syntax Compiler analysis could determine parallelism implicitly!$acc DATA [copyin] [copyout]!$acc PARALLEL [ gang ] [ worker ] [ vector ]! worka(:,:) = tr3d(:,:,1)!unclear if openacc can handle this!$acc LOOP [gang] [worker] [vector] do ipn=ips,ipe!$acc LOOP [gang] [worker] [vector] do k=1,nvl worka(k,ipn) = tr3d(k,ipn,1) end do end do!$acc END PARALLEL

FIM Parallelization for Fine-Grain Well established code Designed in 2000 for CPUs Near operational status Running daily at 10, 15, 30 KM resolutions Multi-faceted development Ensembles, chemistry, ocean Code Structure Fortran Modules Deeper call tree than NIM Limited ability to change the code Demonstrate performance benefit No degradation in clarity of code Otherwise scientists must evaluate lat-lon a ( k, i, j ) FIM: k a [ k, indx) Parallelism - Dynamics GPU Blocking in horizontal Threading in vertical OpenMP, MIC Threading in horizontal Vectorize in vertical i

Validation of Results Before CUDA V4.3, digits of accuracy were used to compare FIM / NIM results between the CPU and GPU, MIC Variable Ndifs RMS (1) RMSE max DIGITS rublten 2228 0.1320490309E-03 0.2634E-09 0.3922E-05 5 rvblten 2204 0.2001348128E-03 0.6318E-09 0.2077E-04 4 exch_h 3316 0.1670498588E+02 0.8979E-05 0.8379E-05 5 hpbl 9 0.4522379124E+03 0.2688E-03 0.1532E-04 4 Small differences for 1 timestep can become significant when running a model over many timesteps Now model results are identical (Intel, MIC, NVIDIA) When the fused multiply-add instruction is turned off Eliminates truncation of arithmetic operations Significantly speeds parallelization and detects synchronization bugs Exceptions: Power, log functions and likely other intrisics System versions replaced by a library routine and used for correctness tests

FIM Performance Single Socket No changes to the FIM source code GPU timings used F2C-ACC compiler Optimized for Fermi GPU, further optimizations for Kepler needed Code changes need to improve hybgen performance (currently >50% of GPU runtime) FIM Dynamics Routines NVIDIA Fermi GPU 1 socket Intel CPU SandyBridge 1 socket Intel Xeon Phi KNC 1 socket NVIDIA Kepler GPU Early Results trcadv 1.28 2.10 1.09 (1.9) 0.99 (2.0) cnuity 1.04 1.13 0.49 (2.3) 0.68 (1.8) momtum 0.41 0.71 0.34 (2.1) 0.35 (2.1) hybgen 4.13 4.09 3.10 (1.3) 3.43 (1.2) TOTAL 8.01 8.87 5.86 (1.5) 6.30 (1.4)

NIM Parallelization for Fine Grain Uniform, hexagonal-based, icosahedral grid Novel indirect addressing scheme permits concise, efficient code Designed for fine-grain parallel in 2008 Dynamics Running on GPU, CPU, serial, parallel-mpi, openmp, MIC soon Physics: GFS, YSU OpenMP, MIC, GPU parallelization planned Testing at 120, 60, 30km Aqua-Planet to 300 days Testing at 120, 60 KM real data runs j k ipn NIM: a ( k, ipn ) k lat-lon a ( k, i, j ) NIM: i a [ k, indx) Parallelism - Dynamics GPU Blocking in horizontal Threading in vertical OpenMP, MIC Threading in horizontal Vectorize in vertical

NIM Serial Performance (2013) No changes to the source code Single Socket Performance 10K horizontal points, 96 vertical levels Very efficient CPU performance Measured 29% of peak performance (Intel Westmere) NIM Opteron Westmere SandyBridge Fermi K20x runtime 143.0 86.8 60.0 25.0 20.7 Parallel performance Being run on up to 160 GPUs Working on optimizing inter-gpu communications

GPU to GPU Communications Scalable Modeling System Distributed memory parallelization Directive-based Fortran compiler and MPI-based runtime library Used at NOAA for over 2 decades Extended to support inter GPU communications!sms$exchange(a,b) #1 GPU pack #5 GPU unpack #2 copy to CPU #4 copy to GPU CPU #2 SMS #3 MPI-based communications CPU #1

NIM Parallel Performance Weak Scaling 4096 columns / GPU, 96 vertical levels, 7000 time steps (~10 days) Kepler K20x Initial Results Over 50% of the run time was spent doing GPU-to-GPU communications Number of Kepler GPUs GPU to GPU Communications 10 1059 (57%) 1859 40 1071 (57%) 1878 160 1082 (57%) 1890 Total Time In Seconds Comm Function 160 GPUs Initialization 3 Pack Data on GPU 861 CPU GPU Copy 59 MPI Communications 82 UnPack on GPU 77 Total Time 1082 Total Time

NIM Parallel Performance Weak Scaling with Communications Optimization Moved collective operation to the CPU instead of doing it on the GPU using GPU MappedMemory Too many small writes across the PCIe bus Resulted in a 5-17x speedup for the UnPack Operation GPUs GPU to GPU Comm Time 10 232 (22%) 1034 40 247 (23%) 1054 160 266 (24%) 1076 Total Time Time (sec) Operation Time Iniitilalization 3 ( 1%) Pack Data on GPU 45 (17%) CPU GPU Copy 59 (22%) MPI Comms 82 (31%) UnPack on GPU 77 (29%) Total 266

3.5KM NIM on Titan in 2013 Dynamics on GPU, Physics on CPU + OMP GPU CPU Dynamics GPU to CPU Physics CPU To GPU Dynamics 10 day forecast, ~10,000 horizontal points / GPU Resolution KM Vertical Levels GPUs Dynamics CPU-GPU Transfer Physics 30 96 40 1700 400 1900 1.1 15 96 160 1700 400 1900 2.2 Total Time hours 7.5 96 640 1700 400 1900 4.4 (7.2%) 3.75 96 2560 1700 400 1900 8.8 (3.6%) 3.75 192 5120 1700 400 1900 8.8 (3.6%)