Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs

Similar documents
Porting COSMO to Hybrid Architectures

Physical parametrizations and OpenACC directives in COSMO

Preparing a weather prediction and regional climate model for current and emerging hardware architectures.

An update on the COSMO- GPU developments

Deutscher Wetterdienst

GPU Consideration for Next Generation Weather (and Climate) Simulations

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

Dynamical Core Rewrite

Porting the ICON Non-hydrostatic Dynamics and Physics to GPUs

PP POMPA (WG6) News and Highlights. Oliver Fuhrer (MeteoSwiss) and the whole POMPA project team. COSMO GM13, Sibiu

STELLA: A Domain Specific Language for Stencil Computations

COSMO Dynamical Core Redesign Tobias Gysi David Müller Boulder,

DOI: /jsfi Towards a performance portable, architecture agnostic implementation strategy for weather and climate models

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

Deutscher Wetterdienst

CLAW FORTRAN Compiler source-to-source translation for performance portability

John Levesque Nov 16, 2001

What Can a Small Country Do? The MeteoSwiss Implementation of the COSMO Suite on the Cray XT4

CLAW FORTRAN Compiler Abstractions for Weather and Climate Models

The CLAW project. Valentin Clément, Xavier Lapillonne. CLAW provides high-level Abstractions for Weather and climate models

Porting the ICON Non-hydrostatic Dynamics to GPUs

Deutscher Wetterdienst. Ulrich Schättler Deutscher Wetterdienst Research and Development

OpenACC programming for GPGPUs: Rotor wake simulation

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss. PP POMPA status.

An Introduction to OpenACC

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Experiences with CUDA & OpenACC from porting ACME to GPUs

TORSTEN HOEFLER

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

OP2 FOR MANY-CORE ARCHITECTURES

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms

GPUs and Emerging Architectures

Weather & Ocean Code Restructuring for OpenACC. Carl Ponder, Ph.D. HPC Applications Performance NVIDIA Developer Technology

GPU Developments for the NEMO Model. Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA

MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

The Road to Exascale: It s about the Journey, not the Flops

arxiv: v1 [hep-lat] 12 Nov 2013

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

Using GPUs for ICON: An MPI and OpenACC Implementation

Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015

Running the FIM and NIM Weather Models on GPUs

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

High performance Computing and O&G Challenges

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Progress Porting WRF to GPU using OpenACC

Accelerator programming with OpenACC

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

S Comparing OpenACC 2.5 and OpenMP 4.5

Early Experiences Writing Performance Portable OpenMP 4 Codes

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

News from the consortium

Stan Posey, NVIDIA, Santa Clara, CA, USA

Peter Messmer Developer Technology Group Stan Posey HPC Industry and Applications

An Introduction to the SPEC High Performance Group and their Benchmark Suites

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

Parallel Computing. November 20, W.Homberg

GPU Computing with NVIDIA s new Kepler Architecture

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Technology for a better society. hetcomp.com

An Introduc+on to OpenACC Part II

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Paralization on GPU using CUDA An Introduction

CPU-GPU Heterogeneous Computing

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Hybrid multicore supercomputing and GPU acceleration with next-generation Cray systems

n N c CIni.o ewsrg.au

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Improving the Energy- and Time-to-solution of COSMO-ART

A High Level Programming Environment for Accelerated Computing. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

OpenStaPLE, an OpenACC Lattice QCD Application

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

PLAN-E Workshop Switzerland. Welcome! September 8, 2016

An OpenACC GPU adaptation of the IFS cloud microphysics scheme

Illinois Proposal Considerations Greg Bauer

OpenACC Course. Office Hour #2 Q&A

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

CPU GPU. Regional Models. Global Models. Bigger Systems More Expensive Facili:es Bigger Power Bills Lower System Reliability

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

OPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Accelerating Financial Applications on the GPU

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

Directive-based Programming for Highly-scalable Nodes

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Using the Cray Programming Environment to Convert an all MPI code to a Hybrid-Multi-core Ready Application

Transcription:

Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs O. Fuhrer, T. Gysi, X. Lapillonne, C. Osuna, T. Dimanti, T. Schultess and the HP2C team Eidgenössisches Departement des Innern EDI Bundesamt für Meteorologie und Klimatologie MeteoSchweiz Name (change on Master slide) C2SM 1

The COSMO model Limited-area model Run operationally by several National Weather Service within the Consortium for Small Scale Modelling: Germany, Switzerland, Italy, Poland, Greece, Rumania, Russia. Used for climate research in several academics institutions 2

Why using Graphical Processor Units? Higher peak performance at lower cost / power consumption High memory bandwidth Cores Freq. (GHz) Peak Perf. S.P. (GFLOPs) Peak Perf. D.P. (GFLOPs) Memory Bandwith (GB/sec) Power Cons. (W) CPU: AMD Opteron (Interlagos) GPU: Fermi X2090 16 2.1 268 134 57 115 512 1.3 1330 665 155 225 X 5 X 3 3

Using GPUs : the accelerator approach CPU and GPU have different memories CPU Flops Flops Flops Hybrid system CPU GPU Flops Most intensive parts are ported to GPU, data is copied back and forth between the GPU and the CPU between each accelerated part. 4

What does this mean for NWP application? Low FLOP count per load/store (stencil computation) Example with COSMO-2 (operational configuration at MeteoSwiss) : * Part Dynamics Physics Total Time/ t 172 ms 36 ms 253 ms vs Transfer of ten prognostic variables 118 ms CPU-GPU data transfer time is large with respect to computation time: Accelerator approach might not be optimal * CPU measurements: Cray XT6, Magny-Cours, 45 nodes, COSMO-2 GPU measurements: PCIe 8x, HP SL390, 2 GB/s one way, 8 sub-domains 5

Our strategy : full GPU port All code which uses grid data fields at every time step is ported to GPU Input/Setup keep on CPU / copy to GPU t Initialize timestep Physics Dynamics Assimilation Halo-update Diagnostics Output OpenACC directives OpenACC directives C++ rewrite (uses CUDA) OpenACC directive (part on CPU) GPU-GPU communication library (GCL) OpenACC directives keep on CPU / copy from GPU Finalize 6

The HP2C OPCODE project Part of the Swiss High Performance High Productivity initiative Prototype implementation of the COSMO production suite of MeteoSwiss making aggressive use of GPU technology Same time-to-solution on substantially cheaper hardware: 144 CPUs with 12 cores each (1728 cores) 18cm 1 cabinet Cray XE5 GPU based hardware OPCODE prototype code should be running by end 2012 These developments are part of the priority project POMPA within the COSMO consortium : preoperational GPU-version of COSMO for 2014 7

MeteoSwiss COSMO-7 and COSMO-2 COSMO-7: 72h 3x/day, 6.6km, 60 levels IFS @ ECMWF: lateral boundaries 4x /day 16km, 91 levels COSMO-2: 33h 8x/day 2.2km, 60 levels 520 x 350 x 60 grid points 8

COSMO on the demonstrator system Demonstrator Multi-GPU node Multicore Processor One MPI task per GPU 9

Dynamical core refactoring Dynamics Small group of developers Memory bandwidth bound Complex stencils (IJK-dependencies) 60% of runtime 40 000 lines (18%) Complete rewrite in C++ Development of a stencil library Target architectures CPU (x86) and GPU Could be extended to other architectures Long term adaptation of the model Communication library Requirement for multi-node communications that can be called from the new dynamical core. New communication library (GCL) Can be called from C++ and Fortran code Can handle CPU-CPU and GPU-GPU communication Allows overlap of communication and computation Note : only single node results will be presented 10

Stencil computations COSMO is using finite differences on a structured grid Stencil computation is the dominating algorithmic motif within the dycore Stencil Definition Kernel updating array elements according to a fixed access pattern Example 2D Laplacian lap(i,j,k) = 4.0 * data(i,j,k) + data(i+1,j,k) + data(i-1,j,k) + data(i,j+1,k) + data(i,j-1,k); 11

Stencil library development Motivation Provide a way to implement stencils in a platform independent way Hide complex/hardware dependent optimizations from the user Single source code which is performance portable Solution retained DSEL : Domain Specific Embedded Language C++ library using template meta programming Optimized back-ends for GPU and CPU CPU GPU Storage Order (Fortran) KIJ IJK Parallelization OpenMP CUDA 12

Stencil code concepts loop-logic update-function / stencil DO k = 1, ke DO j = jstart, jend DO i = istart, iend lap(i,j,k) = -4.0 * data(i,j,k) + data(i+1,j,k) + data(i-1,j,k) + data(i,j+1,k) + data(i,j-1,k) ENDDO ENDDO ENDDO A stencil definition consists of 2 parts Loop-logic: Defines stencil application domain and execution order Update-function: Expression evaluated at every location DSEL USER While the loop-logic is platform dependent the update-function is not treat the two separately 13

Programming the new dycore enum { data, lap }; template<typename TEnv> struct Lap { STENCIL_STAGE(TEnv) Updatefunction STAGE_PARAMETER(FullDomain, data) STAGE_PARAMETER(FullDomain, lap) static void Do(Context ctx, FullDomain) { ctx[lap::center()] = -4.0 * ctx[data::center()] + ctx[data::at(iplus1)] + ctx[data::at(iminus1)] + ctx[data::at(jplus1)] + ctx[data::at(jminus1)]; } }; IJKRealField lapfield, datafield; Stencil stencil; StencilCompiler::Build( Stencil stencil, Setup "Example", calculationdomainsize, StencilConfiguration<Real, BlockSize<32,4> >(), define_sweep<kloopfulldomain>( define_stages( StencilStage<Lap, IJRange<cComplete,0,0,0,0> >() ) ) ); DO k = 1, ke for(int step = 0; step DO j < = 10; jstart, ++step) jend DO i = istart, iend { stencil.apply(); lap(i,j,k) = data(i+1,j,k) + } ENDDO ENDDO ENDDO 14

Dynamics, single-node performance Test domain 128x128x60. CPU: 16 cores Interlagos CPU; GPU : X2090 CPU / OpenMP Backend Factor 1.6x - 1.7x faster than the COSMO dycore No explicit use of vector instructions (10% up to 30% improvement) GPU / CUDA backend Tesla M2090 (GPU with 150 GB/s memory bandwidth) is roughly a factor 2.6x faster than Interlagos (CPU with 52 GB/s memory bandwidth) Ongoing performance optimizations CPU Fortran (Interlagos) Speedup CPU HP2C (Interlagos) GPU HP2C (Tesla M2090) Speedup 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 15

Pros and Cons of the rewrite approach Pros Performance and portability Better separation of implementation strategy and algorithm Single source code The library suggest / forces certain coding conventions and styles Flexibility to extend to other architecture Cons/difficulties This is a big step with respect to the original Fortran code Can this be taken over by main developers of the dycore? (workshop, knowledge transfer ) Adding support for new hardware platforms requires a deep understanding of the library implementation 16

Physics and data assimilation port to GPU Physics Large group of developers Some code maybe shared with other model Less memory bandwidth bound Simpler stencils (K-dependencies) 20% of runtime 43 000 lines (19%) Data assimilation Very large code 83 000 lines (37%) 1 % of runtime GPU port with OpenACC directives Optimization of the code to get optimal performance on GPU Most routines have for the moment a GPU and CPU version GPU port with OpenACC directives only for parts accessing multiple grid data field No code optimization Single source code Some parts still computed on CPU 17

Directives / Compiler choices for OPCODE!$acc parallel!$acc loop gang vector do i=1,n a(i)=b(i)+c(i) end do!$acc end parallel OpenAcc: Open standard, supported by 3 compiler vendors PGI, Cray, Caps Solution retained for OPCODE (for physics and assimilation) PGI : some remaining issues with the compiler Cray: The code can be compiled and run. Gives correct results CAPS: not investigated yet PGI proprietary: First implementation of the physics Translation to OpenAcc is not an issue Being able to test code with different compilers is essential 18

Implementation strategy with directives Parallelization: horizontal direction, 1 thread per vertical column Most loop structures unchanged, one only adds directives In some parts, loop restructuring to reduce kernel call overheads, and profit from cache reuse.!$acc data present(a,c1,c2)!vertical loop do k=2,nz!work 1!$acc parallel loop vector_length(n) do ip=1,nproma c2(ip)=c1(ip,k)*a(ip,k-1) end do!$acc end parallel loop!work 2!$acc parallel loop vector_length(n) do ip=1,nproma a(ip,k)=c2(ip)*a(ip,k-1) end do!$acc end parallel loop end do!$acc end data!$acc data present(a,c1)!$acc parallel loop vector_length(n) do ip=1,nproma!vertical loop do k=2,nz!work 1 c2=c1(ip,k)*a(ip,k-1)!work 2 a(ip,k)=c2*a(ip,k-1) end do end do!$acc end parallel loop!$acc end data Remove Fortran automatic arrays in subroutines which are often called (to avoid call to cudamalloc) Data regions to avoid CPU-GPU transfer Use profiler to target specific parts which need further optimization: reduce memory usage, replace intermediate arrays with scalars Lapillonne and Fuhrer submitted to Parallel Processing Letters 19

Ported Parametrizations Currently implemented and tested physics: Microphysics (ice-scheme) Radiation (Ritter-Geleyn) Turbulence (Raschendorfer) Soil (Terra) These 4 parametrizations account for 90-95% of the physics time of a typical COSMO-2 run. First meaningful real case simulations are possible with this reduced set. 20

Performance results for the physics Test domain 128x128x60 16 cores Interlagos CPU vs X2090 GPU 6 5 "CCE+OpenACC" "PGI+PGI directives" Speedup 4 3 2 1 0 Microphysics Radiation Turbulence Soil Total Physics Overall speed up x3.6 Similar performance with Cray CCE and PGI Running the GPU-Optimized code on CPU is about 25% slower separate source for time critical routines 21

Our experience using directives Relatively easy to get the code running Useful to port large part of the code Requires some work to get performance: data placement, restructuring, additional optimization Ex: GPU part of assimilation is 20% to 50% slower on GPU than on CPU Having a single source code that run efficiently on GPU and CPU (x86) is still an issue 22

Putting all together: can we run this? The dynamical core is compiled as a library: gcc + nvcc Linked with the Fortran part + OpenACC So far only working with Cray CCE GPU pointers are passed from fortran to C++ library using the host_data directive: No data transfer required! 23

Conclusions Our strategy : full GPU port Dynamics : complete rewrite Physics and data assimilation : OpenACC directives Could achieve same time to solution than current operational with a Multi-GPU node having o(10) GPUs : demonstrator system for end 2012 24

Acknowledgments J. Pozanovick and all CSCS support R. Ansaloni, Cray P. Messmer, Nvidia M. Wolfe, M. Colgrove PGI 25