CLAW FORTRAN Compiler source-to-source translation for performance portability

Similar documents
CLAW FORTRAN Compiler Abstractions for Weather and Climate Models

Deutscher Wetterdienst

Deutscher Wetterdienst

The CLAW project. Valentin Clément, Xavier Lapillonne. CLAW provides high-level Abstractions for Weather and climate models

Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs

Porting COSMO to Hybrid Architectures

Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss. PP POMPA status.

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

GPU Developments for the NEMO Model. Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA

Physical parametrizations and OpenACC directives in COSMO

Porting the ICON Non-hydrostatic Dynamics and Physics to GPUs

Using EasyBuild and Continuous Integration for Deploying Scientific Applications on Large Scale Production Systems

Accelerating Financial Applications on the GPU

Stan Posey, NVIDIA, Santa Clara, CA, USA

GPUs and Emerging Architectures

Experiences with CUDA & OpenACC from porting ACME to GPUs

GPU Developments in Atmospheric Sciences

Can Accelerators Really Accelerate Harmonie?

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

S Comparing OpenACC 2.5 and OpenMP 4.5

An OpenACC GPU adaptation of the IFS cloud microphysics scheme

OpenACC (Open Accelerators - Introduced in 2012)

Accelerator programming with OpenACC

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC

Our Workshop Environment

PLAN-E Workshop Switzerland. Welcome! September 8, 2016

Carlos Osuna. Meteoswiss.

Optimizing Weather Model Radiative Transfer Physics for the Many Integrated Core and GPGPU Architectures

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

COMP Parallel Computing. Programming Accelerators using Directives

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Running the FIM and NIM Weather Models on GPUs

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

RAMSES on the GPU: An OpenACC-Based Approach

Miwako TSUJI XcalableMP

CSCS Proposal writing webinar Technical review. 12th April 2015 CSCS

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

ECE 574 Cluster Computing Lecture 18

OpenStaPLE, an OpenACC Lattice QCD Application

Our Workshop Environment

Status of the COSMO GPU version

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

Our Workshop Environment

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Porting the ICON Non-hydrostatic Dynamics to GPUs

PP POMPA (WG6) News and Highlights. Oliver Fuhrer (MeteoSwiss) and the whole POMPA project team. COSMO GM13, Sibiu

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

arxiv: v1 [hep-lat] 12 Nov 2013

Illinois Proposal Considerations Greg Bauer

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

A case study of performance portability with OpenMP 4.5

Improving the Energy- and Time-to-solution of COSMO-ART

OpenACC Fundamentals. Steve Abbott November 13, 2016

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Optimised all-to-all communication on multicore architectures applied to FFTs with pencil decomposition

OpenACC compiling and performance tips. May 3, 2013

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core

Programming paradigms for GPU devices

Evaluating the Performance and Energy Efficiency of the COSMO-ART Model System

GPU. OpenMP. OMPCUDA OpenMP. forall. Omni CUDA 3) Global Memory OMPCUDA. GPU Thread. Block GPU Thread. Vol.2012-HPC-133 No.

OpenACC Course. Office Hour #2 Q&A

Accelerating Harmonie with GPUs (or MICs)

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

KernelGen a toolchain for automatic GPU-centric applications porting. Nicolas Lihogrud Dmitry Mikushin Andrew Adinets

COSMO Dynamical Core Redesign Tobias Gysi David Müller Boulder,

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

Programming NVIDIA GPUs with OpenACC Directives

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015

Early Experiences with the OpenMP Accelerator Model

Using GPUs for ICON: An MPI and OpenACC Implementation

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

CPU+GPU Programming for ASUCA. Production Weather Model

Directive-based Programming for Highly-scalable Nodes

OPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS

Dynamical Core Rewrite

Advanced OpenACC. Steve Abbott November 17, 2017

Trends in HPC (hardware complexity and software challenges)

OpenACC 2.6 Proposed Features

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Our Workshop Environment

GPU Consideration for Next Generation Weather (and Climate) Simulations

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenMP 4.0. Mark Bull, EPCC

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Lecture 1: Gentle Introduction to GPUs

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

OpenACC Fundamentals. Steve Abbott November 15, 2017

Transcription:

CLAW FORTRAN Compiler source-to-source translation for performance portability XcalableMP Workshop, Akihabara, Tokyo, Japan October 31, 2017 Valentin Clement valentin.clement@env.ethz.ch Image: NASA

Summary Performance portability problem in COSMO Single source code Portability Performance Portability of code CLAW FORTRAN Compiler Single column abstraction Future project Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 2

GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production + R&D Each rack is composed of 12 Compute nodes: 2x Intel Haswell E5-2690v3 2.6 GHz 12-core CPUs (total of 24 CPUs) 256 GB 2133 Mhz DDR4 (total of 3TB) 8x Dual NVIDIA TESLA K80 GPU (total of 96 cards - 192 GPUs) Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 3

GPU machine in Switzerland - Piz Daint (CSCS) Cray XC40/XC50 Intel Xeon E5-2690 v3 2.60GHz, 12 cores, 64GB RAM NVIDIA Tesla P100 Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 4

Performance portability problem - COSMO Radiation 1.2-30% Execution time [s] 0.9 0.6 0.3-35% 0 Execution on CPU Haswell E5-2690v3 12 cores Code restructured for CPU Code restructured for GPU with OpenACC Execution on GPU 1/2 NVIDIA Tesla K80 Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 5

Performance portability problem - COSMO Radiation CPU structure DO k=1,nz CALL fct() DO j=1,nproma! 1st loop body DO j=1,nproma! 2nd loop body DO j=1,nproma! 3rd loop body GPU structure!$acc parallel loop DO j=1,nproma!$acc loop DO k=1,nz CALL fct()! 1st loop body! 2nd loop body! 3rd loop body!$acc end parallel Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 6

How to keep a single source code for everyone Massive code base (200 000 to >1mio LOC) Several architecture specific optimization survive Most of these code base are CPU optimized Not suited for next generation architecture Not suited for massive parallelism Few or no modularity Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 7

What kind of code base are we dealing with? Global/local area weather forecast model >10 around the world Monster FORTRAN 77-2008 monolithic code Without much modularity So far we investigate: COSMO (Local area model consortium) - Several institution ICON - DWD (German Weather Agency) - Will replace COSMO IFS Current Cycle + FVM - ECMWF - Member state usage Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 8

What kind of code base are we dealing with? Example of three code base we investigated so far: COSMO ICON IFS Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 9

COSMO Mode - loc Climate and local area model used by Germany, Switzerland, Russia -------------------------------------------------------------------------------- Language files blank comment code -------------------------------------------------------------------------------- Fortran 90 173 53998 109381 211711 C/C++ Header 148 5595 11827 29888 C++ 121 5050 6189 26580 Python 37 1454 1444 5764 Bourne Again Shell 17 246 381 3206 Bourne Shell 33 544 594 2349 XML 11 272 193 2143 CMake 9 103 98 793 make 1 36 27 230 CUDA 58 4 0 58 -------------------------------------------------------------------------------- SUM: 620 68232 130684 286710 -------------------------------------------------------------------------------- Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 10

ECMWF IFS - loc European Centre for Medium-range weather forecasts - Global Model ------------------------------------------------------------------------------- Language files blank comment code ------------------------------------------------------------------------------- Fortran 90 4003 201338 300427 737289 C/C++ Header 480 846 0 13041 Perl 1 89 13 240 ------------------------------------------------------------------------------- SUM: 4484 202273 300440 750570 -------------------------------------------------------------------------------- *only source without external modules Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 11

DWD ICON - loc New German Global model with option to used it as local area model --------------------------------------------------------------------------------------- Language files blank comment code --------------------------------------------------------------------------------------- Fortran 90 822 99802 144962 447356 C 219 43854 30991 150781 HTML 307 449 15415 94940 Fortran 77 463 294 113285 64061 Java 95 2685 4335 11605 C/C++ Header 106 2194 8359 8332 Python 43 2163 2425 7656 --------------------------------------------------------------------------------------- SUM: 2599 174509 346197 931446 --------------------------------------------------------------------------------------- Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 12

Portability of code Current code base Optimized for one architecture Often with old optimizations still in place Not really modular Global fields injected everywhere Future? Modular standalone package that can be plug In different model For different architecture Abstract data layout where it can be done Abstract model specific data, everything passed as arguments. Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 13

Performance portability - current code Loop structure better for CPUs DO ilev = 1, nlay DO icol = 1, ncol tau_loc(icol,ilev) = max(tau(icol,ilev,igpt) trans(icol,ilev) = exp(-tau_loc(icol,ilev)) DO ilev = nlay, 1, -1 DO icol = 1, ncol radn_dn(icol,ilev,igpt) = trans(icol,ilev) * radn_dn(icol,ilev+1,igpt) DO ilev = 2, nlay + 1 DO icol = 1, ncol radn_up(icol,ilev,igpt) = trans(icol,ilev-1) * radn_up(icol,ilev-1,igpt) ilev -> dependency icol -> no dependency nlay size ~= 100 ncol size >= 10 000 Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 14

Performance portability - current code Sometimes using nproma block optimization for better vectorization!$omp parallel default(shared) DO igpt = 1, ngptot, nproma CALL physical_parameteriztaion( )! Code from previous silde!$omp end parallel Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 15

Performance portability - GPU structured code Loop structure better for GPUs DO icol = 1, ncol DO ilev = 1, nlay tau_loc(icol,ilev) = max(tau(icol,ilev,igpt) trans(icol,ilev) = exp(-tau_loc(icol,ilev)) DO ilev = nlay, 1, -1 radn_dn(icol,ilev,igpt) = trans(icol,ilev) * radn_dn(icol,ilev+1,igpt) DO ilev = 2, nlay + 1 radn_up(icol,ilev,igpt) = trans(icol,ilev-1) * radn_up(icol,ilev-1,igpt) ilev -> dependency icol -> no dependency nlay size ~= 100 ncol size >= 10 000 Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 16

Performance portability - next architecture - What is the best loop structure/data layout for next architecture? - Do we want to rewrite the code each time? - Do we know exactly which architecture we will run on?? Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 17

What is the CLAW FORTRAN Compiler? Source-to-source translation for FORTRAN code Based on XcodeML/F IR Using OMNI Compiler front-end and back-end Contribution to it via GitHub Transformation of AST Different transformation applied based on target Promotion of scalar and arrays Insertion of iteration Insertion of OpenACC and OpenMP directives Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 18

CLAW FORTRAN Compiler under the hood Based on the OMNI Compiler FORTRAN front-end & back-end Source-to-source translator Open source under the BSD license Available on GitHub with the specifications.f90.f90 FPP F_Front CX2X F_Back CLAWFC Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 19

Single column abstraction Separation of concerns Domain scientists focus on their problem (1 column, 1 box) CLAW compiler produce code for each target and directive languages Achieve modularity Standalone physical parameter Modular from model specificity Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 20

RRTMGP Example - A nice modular code CPU structured F2003 radiation code From Robert Pincus and al. from AER University of Colorado Compute intensive part are well located in kernel module. Code is non-the-less CPU structured with horizontal loop as the inner most in every iteration. Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 21

RRTMGP Example - original code - CPU structured Loop over spectral intervals Loop over vertical dimension Loop over horizontal dimension SUBROUTINE lw_solver(ngpt, nlay, tau, )! DECLARATION PART OMITTED DO igpt = 1, ngpt DO ilev = 1, nlay DO icol = 1, ncol tau_loc(icol,ilev) = max(tau(icol,ilev,igpt) trans(icol,ilev) = exp(-tau_loc(icol,ilev)) DO ilev = nlay, 1, -1 DO icol = 1, ncol radn_dn(icol,ilev,igpt) = trans(icol,ilev) * radn_dn(icol,ilev+1,igpt) DO ilev = 2, nlay + 1 DO icol = 1, ncol radn_up(icol,ilev,igpt) = trans(icol,ilev-1) * radn_up(icol,ilev-1,igpt) radn_up(:,:,:) = 2._wp * pi * quad_wt * radn_up(:,:,:) radn_dn(:,:,:) = 2._wp * pi * quad_wt * radn_dn(:,:,:) END SUBROUTINE lw_solver Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 22

RRTMGP Example - Single column abstraction SUBROUTINE lw_solver(ngpt, nlay, tau, ) Only dependency on these iteration spaces! DECLARATION DECL: Fields PART don t OMITTED have the horizontal dimension (demotion) DO igpt = 1, ngpt DO ilev = 1, nlay tau_loc(ilev) = max(tau(ilev,igpt) trans(ilev) = exp(-tau_loc(ilev)) DO ilev = nlay, 1, -1 radn_dn(ilev,igpt) = trans(ilev) * radn_dn(ilev+1,igpt) DO ilev = 2, nlay + 1 radn_up(ilev,igpt) = trans(ilev-1) * radn_up(ilev-1,igpt) radn_up(:,:) = 2._wp * pi * quad_wt * radn_up(:,:) radn_dn(:,:) = 2._wp * pi * quad_wt * radn_dn(:,:) END SUBROUTINE lw_solver Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 23

RRTMGP Example - CLAW code Algorithm for one column only SUBROUTINE lw_solver(ngpt, nlay, tau, )!$claw parallelize! model dimension info located in config DO igpt = 1, ngpt DO ilev = 1, nlay tau_loc(ilev) = max(tau(ilev,igpt) trans(ilev) = exp(-tau_loc(ilev)) DO ilev = nlay, 1, -1 radn_dn(ilev,igpt) = trans(ilev) * radn_dn(ilev+1,igpt) DO ilev = 2, nlay + 1 radn_up(ilev,igpt) = trans(ilev-1) * radn_up(ilev-1,igpt) radn_up(:,:) = 2._wp * pi * quad_wt * radn_up(:,:) radn_dn(:,:) = 2._wp * pi * quad_wt * radn_dn(:,:) END SUBROUTINE lw_solver Dependency on the vertical dimension only Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 24

RRTMGP Example - CLAW transformation Original code (Architecture agnostic) CLAWFC.f90 A single source code Specify a target architecture for the transformation Specify a compiler directives language to be added CPU OpenMP.f90 MIC OpenMP.f90 GPU OpenACC.f90 Automatically transformed code clawfc --directive=openacc --target=gpu -o mo_lw_solver.acc.f90 mo_lw_solver.f90 clawfc --directive=openmp --target=cpu -o mo_lw_solver.omp.f90 mo_lw_solver.f90 clawfc --directive=openmp --target=mic -o mo_lw_solver.mic.f90 mo_lw_solver.f90 Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 25

CLAW - One column - OpenACC - local array strategy Data analysis for generation of OpenACC directives Potentially collapsing loops Generate data transfer if wanted Adapt data layout Promotion of scalar and array fields with model dimensions Detect unsupported statements for OpenACC Insertion of do statements to iterate of new dimensions Insertion of directives (OpenMP/OpenACC) Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 26

RRTMGP Example - GPU w/ OpenACC SUBROUTINE lw_solver(ngpt, nlay, tau, )! DECL: Fields promoted accordingly to usage!$acc data present( )!$acc parallel!$acc loop gang vector private( ) collapse(2) DO icol = 1, ncol, 1 DO igpt = 1, ngpt, 1!$acc loop seq DO ilev = 1, nlay, 1 tau_loc(ilev) = max(tau(icol,ilev,igpt) trans(ilev) = exp(-tau_loc(ilev))!$acc loop seq DO ilev = nlay, 1, (-1) radn_dn(icol,ilev,igpt) = trans(ilev) * radn_dn(icol,ilev+1,igpt)!$acc loop seq DO ilev = 2, nlay + 1, 1 radn_up(icol,ilev,igpt) = trans(ilev-1)*radn_up(icol,ilev-1,igpt)!$acc loop seq DO igpt = 1, ngpt, 1!$acc loop seq DO ilev = 1, nlay + 1, 1 radn_up(icol,igpt,ilev) = 2._wp * pi * quad_wt * radn_up(icol,igpt,ilev) radn_dn(icol,igpt,ilev) = 2._wp * pi * quad_wt * radn_dn(icol,igpt,ilev)!$acc end parallel!$acc end data END SUBROUTINE lw_solver Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 27

CLAW - One column - OpenACC - local array strategy Example of different strategy easy to test with an automatize workflow: 1.Privatize local arrays Make local arrays private (unsupported for allocatable arrays) 2.Promote arrays Reduce allocation overhead Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 28

RRTMGP lw_solver - Original vs. CLAW CPU/OpenMP RRTMGP lw_solver comparison of different kernel version / Domain size: 100x100x42 Piz Kesch (Haswell E5-2690v3 12 cores vs. 1/2 NVIDIA Tesla K80) PGI Reference: original source code on 1-core 8 8.1x 6 5 Speedup 3 2 0 1.0x 0.5x Original code One column abstraction CLAW CPU/OpenMP (12cores) Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 29

RRTMGP lw_solver - CLAW CPU vs. CLAW GPU Comparison of different kernel version / Domain Size100x100x42 Piz Kesch (Haswell E5-2690v3 12 cores vs. 1/2 NVIDIA Tesla K80) PGI Reference: CLAW CPU/OpenMP 12-cores 6 6.1x 5 4 Speedup 2 1 1.0x 0 CLAW CPU/OpenMP (12cores) CLAW GPU/OpenACC (0.5K80) Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 30

RRTMGP lw_solver - CLAW vs. GridTools 6 Comparison of different kernel version / Domain Size100x100x42 Piz Kesch (Haswell E5-2690v3 12 cores vs. 1/2 NVIDIA Tesla K80) PGI Reference: CLAW CPU/OpenMP 12-cores 5 4 CLAW (Cray 8.4.4) CLAW (PGI 16.3) GridTools (GNU + NVCC) 4.6x 5.0x Speedup 2 3.3x 1 1.0x 0 0.6x 0.5x CLAW (Cray 8.4.4) CLAW (PGI 16.3) GridTools (GNU + NVCC) CLAW (Cray 8.4.4) CLAW (PGI 16.3) GridTools (GNU + NVCC) Haswell E5-2690v3 12 cores 1/2 NVIDIA Telse K80 Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 31

IFS-CloudSC - one column version CloudMircophysics Scheme Take less than a day to create a one column version Can play with it and apply different strategy OpenACC privatization of local arrays OpenACC promotion of local arrays Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 32

IFS-CloudSC - one column version: early results Comparison of different kernel version / Domain Size: 16000x137 Piz Daint (Haswell E5-2690v3 12 cores vs. NVIDIA Tesla P100) Cray 8.6.1 Reference: OpenMP 12-cores 3 2.25 Speedup 1.5 2.6x 2.9x 0.75 1x 0 OpenMP / Block CLAW GPU/OpenACC (private) CLAW GPU/OpenACC (promote) Valentin Clement ECMWF / UKMO / STFC 33

Portability with performance ECMWF IFS Operational DyCore data layout: nproma, level, block IFS Physical Parameterizations data layout: nproma, level ECMWF FVM data layout: jlevel, jnode Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 34

Portability with performance ECMWF IFS Operational DyCore data layout: nproma, level, block IFS Physical Parameterizations data layout: nproma, level model cfg IFS Physical Parameterizations data layout: level ECMWF FVM data layout: jlevel, jnode IFS Physical Parameterizations data layout: level, jnode model cfg Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 35

Portability with performance Transformed code might be the same More parallelism on outer loop for GPU is better Independent from data layout Might have to introduce copy Spending small time copying to new data layout might worth in overall performance Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 36

Portability with performance Changement of data layout with copy 100 75 % 50 25 0 Original layout Oprmized layout Oprmized layout w/ copy Different layout Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 37

CLAW collaboration with the OMNI Compiler Project Only viable FORTRAN source-to-source framework Only one currently maintained Very responsive people Accept Pull Requests 61 issues opened as today -> 51 closed 61 PR as today -> 57 closed Open source takes some effort but it is rewarding! Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 38

ENIAC Project (2017-2020) Enabling ICON model on heterogenous architecture Port to OpenACC GridTools for stencil computation (DyCore) Looking at performance portability in FORTRAN code Enhance CLAW FORTRAN Compiler capabilities Move physical parameterization to single column Getting more numbers :-) Apply transformation for x86, XeonPhi and GPUs Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 39

EuroExa Project Machine will be hosted at STFC in UK ARM processor node Loaded with FPGA ECMWF will investigate single column abstraction Specific transformation for ARM processor Mabye automatic offloading to FPGA (FORTRAN to C translation) Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 40

Only a problem of MeteoSwiss? Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 41

Possible future collaboration Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 42

CLAW FORTRAN Compiler - Resources https://github.com/c2sm-rcm/claw-compiler https://github.com/omni-compiler CLAW FORTRAN Compiler developer s guide Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 43

valentin.clement@env.ethz.ch https://github.com/c2sm-rcm/claw-compiler https://github.com/omni-compiler Valentin Clement 5 th XcalableMP Workshop - 31st of October 2017 44