Porting COSMO to Hybrid Architectures

Similar documents
Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs

COSMO Dynamical Core Redesign Tobias Gysi David Müller Boulder,

Preparing a weather prediction and regional climate model for current and emerging hardware architectures.

Porting the ICON Non-hydrostatic Dynamics and Physics to GPUs

GPU Consideration for Next Generation Weather (and Climate) Simulations

Physical parametrizations and OpenACC directives in COSMO

Deutscher Wetterdienst

Dynamical Core Rewrite

STELLA: A Domain Specific Language for Stencil Computations

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

DOI: /jsfi Towards a performance portable, architecture agnostic implementation strategy for weather and climate models

PP POMPA (WG6) News and Highlights. Oliver Fuhrer (MeteoSwiss) and the whole POMPA project team. COSMO GM13, Sibiu

Deutscher Wetterdienst. Ulrich Schättler Deutscher Wetterdienst Research and Development

Deutscher Wetterdienst

Porting the ICON Non-hydrostatic Dynamics to GPUs

The CLAW project. Valentin Clément, Xavier Lapillonne. CLAW provides high-level Abstractions for Weather and climate models

Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms

An update on the COSMO- GPU developments

OpenACC programming for GPGPUs: Rotor wake simulation

CLAW FORTRAN Compiler Abstractions for Weather and Climate Models

Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss. PP POMPA status.

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

CLAW FORTRAN Compiler source-to-source translation for performance portability

John Levesque Nov 16, 2001

Progress Porting WRF to GPU using OpenACC

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Weather & Ocean Code Restructuring for OpenACC. Carl Ponder, Ph.D. HPC Applications Performance NVIDIA Developer Technology

An Introduction to OpenACC

Experiences with CUDA & OpenACC from porting ACME to GPUs

Running the FIM and NIM Weather Models on GPUs

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

Peter Messmer Developer Technology Group Stan Posey HPC Industry and Applications

Using GPUs for ICON: An MPI and OpenACC Implementation

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

GPU Fundamentals Jeff Larkin November 14, 2016

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

The Road to Exascale: It s about the Journey, not the Flops

PLAN-E Workshop Switzerland. Welcome! September 8, 2016

Recent Advances in Heterogeneous Computing using Charm++

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

Software and Performance Engineering for numerical codes on GPU clusters

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

GPUs and Emerging Architectures

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015

Porting and Tuning WRF Physics Packages on Intel Xeon and Xeon Phi and NVIDIA GPU

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC

RAMSES on the GPU: An OpenACC-Based Approach

CPU GPU. Regional Models. Global Models. Bigger Systems More Expensive Facili:es Bigger Power Bills Lower System Reliability

Carlos Osuna. Meteoswiss.

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core

Developing PIC Codes for the Next Generation Supercomputer using GPUs. Viktor K. Decyk UCLA

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

arxiv: v1 [hep-lat] 12 Nov 2013

GPU Computing with NVIDIA s new Kepler Architecture

Stan Posey, NVIDIA, Santa Clara, CA, USA

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

S Comparing OpenACC 2.5 and OpenMP 4.5

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

Accelerator programming with OpenACC

CUDA. Matthew Joyner, Jeremy Williams

Accelerating Financial Applications on the GPU

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

The Icosahedral Nonhydrostatic (ICON) Model

Parallel Programming. Libraries and Implementations

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

TORSTEN HOEFLER

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

Parallel Computing. November 20, W.Homberg

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

GPU-Powered WRF in the Cloud for Research and Operational Applications

Optimising the Mantevo benchmark suite for multi- and many-core architectures

OpenACC (Open Accelerators - Introduced in 2012)

Developments in Computing Technology: GPUs

Parallelization of the NIM Dynamical Core for GPUs

From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

GPU Developments for the NEMO Model. Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Turbostream: A CFD solver for manycore

OpenACC Course. Office Hour #2 Q&A

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

Transcription:

Porting COSMO to Hybrid Architectures T. Gysi 1, O. Fuhrer 2, C. Osuna 3, X. Lapillonne 3, T. Diamanti 3, B. Cumming 4, T. Schroeder 5, P. Messmer 5, T. Schulthess 4,6,7 [1] Supercomputing Systems AG, [2] Swiss Federal Office of Meteorology and Climatology, [3] Center for Climate Systems Modeling, ETH Zurich, [4] Swiss National Supercomputing Center (CSCS), [5] NVIDIA Corp., [6] Institute for Theoretical Physics, ETH Zurich, [7] Computer Science and Mathematics Division, ORNL Programming weather, climate and earth-system models on heterogeneous multi-core platforms Sept 12 13, 2012, NCAR, Boulder, CO

Why Improving COSMO? COSMO: Consortium for Small-Scale Modeling Limited-area climate model - http://www.cosmo-model.org/ Used by 7 weather services and O(50) universities and research institutes High CPU usage @ CSCS (Swiss National Supercomputing Center) 30 Mio CPU hours in total ~ 50% on a dedicated machine Strong desire for improved simulation quality Higher resolution Larger ensemble simulations Increasing model complexity Performance improvements are critical!

Need for Higher Resolution in Switzerland dx =2km Resolution is of key importance to increase simulation quality 2x resolution 10x computational cost dx =1km Reality

COSMO port to hybrid architectures is part of HP2C Project Part of the Swiss HPCN strategy (hardware / infrastructure / software) Strong focus on hybrid architectures for real world applications 10 Projects from different domains - http://www.hp2c.ch/ Cardiovascular simulation (EPFL) Stellar explosions (University of Basel) Quantum dynamics (University of Geneva) COSMO-CCLM 1. Cloud resolving climate simulations (IPCC AR5) 2. Adapt existing code (hybrid, I/O) 3. Aggressive developments (different programming languages, GPUs)

Refactoring Approach Physics Large group of developers Plug-in code from other models Less memory bandwidth bound Simpler stencils (K-dependencies) 20% of runtime Keep source code (Fortran) GPU port with directives (OpenACC) Lines of Code Runtime Dynamics Small group of developers Memory bandwidth bound Complex stencils (IJK-dependencies) 60% of runtime Aggressive rewrite in C++ Development of a stencil library Still single source code multiple library back-ends for x86 / GPU

Requirements for a Portable Stencil Library DO k = 1, ke DO j = jstart, jend DO i = istart, iend lap(i,j,k) = data(i+1,j,k)+data(i-1,j,k)+data(i,j+1,k)+data(i,j-1,k) 4.0*data(i,j,k) ENDDO ENDDO ENDDO loop-logic update-function / stencil Loop-logic: Defines stencil application domain Platform dependent Update-function: Expression evaluated at each location Platform independent => Treat two components separately

Loop-Logic expressed in Domain Specific Language (DSL) Define embedded domain specific language in C++ using type system/template metaprogramming Code is written as type Type is translated into sequence of operations (DSL compilation) at compile time Operation objects ( code fragments ) are inserted at compile time (code generation) Pre-packaged loop objects for CPU and GPU We use this approach to generate the platform dependent loop-logic DSL loop definition Compiler ApplyBlocks CUDA LoopOverBlock CUDA Platform dependent loop code Library loop objects ApplyBlocks CUDA ApplyBlocks OpenMP LoopOverBlock LoopOverBlock CUDA OpenMP

Putting it all together.. IJKRealField laplacian, pressure; Stencil stencil; StencilCompiler::Build( stencil, ); "Example", calculationdomainsize, StencilConfiguration<Real, BlockSize<32,4> >(),. define_sweep<kloopfulldomain>( define_stages( StencilStage<LapStage, IJBoundary<cComplete,0,0,0,0> >() ) ) stencil.apply(); enum { data, lap }; template<typename TEnv> struct LapStage { STENCIL_STAGE(TEnv) STAGE_PARAMETER(FullDomain, data) STAGE_PARAMETER(FullDomain, lap) static void Do(Context ctx, FullDomain) { ctx[lap::center()] = -4.0 * ctx[data::center()] + ctx[data::at(iplus1)] + ctx[data::at(iminus1)] + ctx[data::at(jplus1)] + ctx[data::at(jminus1)]; } };

Stencil Library Parallelization Coarse grained parallelism (multi-core) Shared memory parallelization Support for 2 levels of parallelism Coarse grained parallelism Split domain into blocks Distribute blocks to CPU cores No synchronization & consistency required Fine grained parallelism Update block on a single core Lightweight threads / vectors Synchronization & consistency required ~ CUDA programming model (should be a good match for other platforms as well) IJ plane block0 block2 block1 block3 Fine grained parallelism (vectorization)

GPU Backend Overview Storage IJK storage order Coalesced reads in I direction Parallelization Parallelize in IJ dimension (blocks are mapped to CUDA blocks) Block boundary elements are updated using additional warps Data field indexing Store data field pointers and strides in shared memory Store indexes in registers CUDA grid splits IJ plane into blocks Block with boundary (use additional boundary warps)

HP2C Dycore Performance CPU / OpenMP Backend Factor 1.6x 1.7x faster than the standard COSMO implementation Here: no SSE support (expect another 10% ~30% improvement) GPU / CUDA backend Tesla M2090 (150 GB/s with ECC enabled) is roughly a factor 2.6x faster than Interlagos (16-Core Opteron CPU with 52 GB/s) Ongoing performance optimization Speedup GPU HP2C (Tesla C2090) CPU HP2C (Interlagos) CPU Fortran (Interlagos) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

Acceleration of Physical Parametrizations: Current State Parametrizations : processes not described by the dynamics, such as radiation or turbulence. Account for about 20 to 25 % of total runtime GPU versions of the parametrizations have been implemented in COSMO Currently implemented and tested physics: Microphysics (Reinhardt and Seifert, 2006) Radiation (Ritter and Geleyn, 1992) Turbulence (Raschendorfer, 2001) Soil (Heise, 1991) Account for 90-95% of physics in typical COSMO-2 run Only options for operational runs are supported Unsupported features have been documented

Directives/Compiler choices for OPCODE OpenAcc: Open standard, supported by 3 compiler vendors PGI, Cray, Caps Directives of choice for final OPCODE version CAPS: Future approach PGI proprietary: Enabled port of all kernels (some workarounds required) First implementation of the physics Translation to OpenAcc relatively straight forward => Testing code with different compilers can be very helpful!!$acc parallel loop vector_length(n) do i=1,n a(i)=b(i)+c(i) end do!$acc end parallel loop

Implementation in COSMO Change to block data structure inside the physics f(i,j,k) -> f_b(nproma,ke), with nproma = istartpar x iendpar / nblock. nblock=1 for GPU run) Physics loop restructured to iterate over blocks transfer from CPU to GPU (ijk data f(i,j,k) )!start block loop do ib=1,nblock call copy_to block call organize_gscp call organize_radiation call organize_turbulence call organize_soil call copy_back end do transfer back GPU to CPU (ijk data f(i,j,k)) Required data on the GPU All operations on grid data computed on the GPU Physics timing region inside physics scheme data is in block form f_b(nproma,ke)

Porting Strategy for Parametrizations Pencil Parallelization: horizontal direction, 1 thread per vertical column Most loop structures unchanged, one only adds directives In some parts: loop restructuring to reduce kernel call overheads, and profit from cache reuse. Remove NEC vector-optimization.!$acc data present(a,c1,c2)!vertical loop do k=2,nz!work 1!$acc parallel loop vector_length(n) do ip=1,nproma c2(ip)=c1(ip,k)*a(ip,k-1)) end do!$acc end parallel loop!work 2!$acc parallel loop vector_length(n)) do ip=1,nproma a(ip,k)=c2(ip)*a(ip,k-1) end do!$acc end parallel loop end do!$acc end data!$acc data present(a,c1)!$acc parallel loop vector_length(n) do ip=1,nproma!vertical loop do k=2,nz!work 1 c2=c1(ip,k)*a(ip,k-1)!work 2 a(ip,k)=c2*a(ip,k-1) end do end do!$acc end parallel loop!$acc end data Remove Fortran automatic arrays in subroutines which are often called (to avoid call to cudamalloc) Data regions to avoid CPU-GPU transfer Use profiler to target specific parts which need further optimization: reduce memory usage, replace intermediate arrays with scalars

GPU/CPU comparison CPU original physics 16 cores (interlagos) using MPI GPU block physics 1 core + 1 GPU (X2090) Time physics: 42.4 s (average time, without communication) Time physics: 12.5 s Benchmark subdomain 128x112x60, 1h simulation with microphysics, radiation, turbulence and soil CPU-GPU results agree within roundoff error

GPU/CPU comparison, detail timing Original CPU time (s) 16 cores (Interlagos) Block Physics GPU time (s) 1 core + 1 GPU (X2090) Speed up Microphysics 15.7 (37%) 2.3 (18%) 6.8x Radiation 11.1 (26%) 2.6 (20%) 4.3x Turbulence 14.4 (34%) 5.1 (41%) 2.8x Soil 1.2 (3%) 0.5 (4%) 2.4x Copy ijk block - 2.0 (16%) - Total physics 42.4 (100%) 12.5 (100%) 3.4x Test subdomain 128x112x60, 1h simulation Currently running the block physics code on CPU (i.e. ignoring directives) is slower (total physics = 53 s). This is due to the GPU-loop reordering optimizations, not to the block structure. Having a single source code that runs efficiently on x86-cpu (i.e. excluding NEC) and GPU will require further work. The GPU code runs 7% faster on CASTOR (C2090)

Summary and next steps Dycore ported using portable stencil library and DESL Physics ported using directives Dycore speedup of ~4x vs original code Physics speedup of ~3.4x vs original code Dycore speedup for relevant domain sizes retained for K20/SandyBridge Ongoing: Combining Dycore, Physics and Messaging Layer

Thank you! pmessmer@nvidia.com

Need for Higher Resolution in Switzerland 35km 8.8 km 2.2 km Resolution is of key importance to increase simulation quality 2x resolution 10x computational cost