Deutscher Wetterdienst

Size: px

Start display at page:

Download "Deutscher Wetterdienst"

William Osborne Dean
6 years ago
Views:

1 Porting Operational Models to Multi- and Many-Core Architectures Ulrich Schättler Deutscher Wetterdienst Oliver Fuhrer MeteoSchweiz Xavier Lapillonne MeteoSchweiz

2 Contents Strong Scalability of the Operational Models GPU Developments for the COSMO-Model Details of the Porting Multi-Core Workshop, Boulder 2

3 Contributing Scientists COSMO-Model: The COSMO Development Team COSMO, GPU Developments: Xavier Lapillonne, Oliver Fuhrer, Carlos Osuna, Thomas Schulthess, and many more ICON: The ICON Development Team, esp. Florian Prill, Daniel Reinert, Günther Zängl Thank you all for providing material for this presentation! Multi-Core Workshop, Boulder 3

4 Strong Scalability for the Operational Models Multi-Core Workshop, Boulder 4

universities and institutes (for research) Also used for climate modelling (COSMO-CLM) and to simulate

5 The COSMO-Model Operational since December 1999 Original DWD code, now further developed and maintained by the COSMO consortium Used at about 30 national weather services (for operational production) and many universities and institutes (for research) Also used for climate modelling (COSMO-CLM) and to simulate aerosols and reactive trace gases (COSMO-ART) Flat MPI implementation Multi-Core Workshop, Boulder 5

6 Old Scalability Results From HP2C Report (June 2010): Performance Analysis and Prototyping of the COSMO Regional Atmospheric Model (Matthew Cordery, et al.) We note the poor parallel scaling characteristics of COSMO beyond 1000 cores Parallel speedup of COSMO for a 1-hour simulation on 1 km grid Tests based on COSMO Version 4.10 (from September 2009) Multi-Core Workshop, Boulder 6

Scalability Tests with COSMO-D2 New domain: 651 716 65 grid points Test Characteristics 12 hour forecast should run in 1200 s in ensemble mode in 400 s in deterministic mode nudgecast run: nudging

7 Scalability Tests with COSMO-D2 New domain: grid points Test Characteristics 12 hour forecast should run in 1200 s in ensemble mode in 400 s in deterministic mode nudgecast run: nudging and latent heat nudging in the first 3h SynSat pictures every 15 minutes amount of output data per hour: 1.6 GByte: asynchronous output is used with 4 or 5 output cores Based on COSMO 5.1 (Nov. 2014) Multi-Core Workshop, Boulder 7

8 Scalability of COSMO Components (incl. Comm.) ,5 Ideal Dynamics Physics Nudging LHN I/O Total 0, Multi-Core Workshop, Boulder 8

9 Is this good or bad? Scalability of COSMO-Model for COSMO-D2 domain size is reasonably well up to 1600 cores. Dynamics and Physics also scale beyond up to 6400 cores (which is interesting for climate applications). Meeting the operational (DWD / NWP) requirements: for ensemble mode about 650 cores would be necessary to run a 12 hour forecast in less than 1200 seconds. for deterministic mode, 6400 cores are needed to run in less than 400 seconds. But this is one of the more demanding applications. Typical NWP and CLM configurations run well on several hundred or a few thousand cores Multi-Core Workshop, Boulder 9

ICON: ICOsahedral Nonhydrostatic Model environmental predictions weather

Replaced GME at DWD in January 2015 as operational global model Hybrid

Max-Planck-Institute for Meteorology about 40 active developers from

joined by KIT to implement ICON-ART (environmental prediction) A regional

10 ICON: ICOsahedral Nonhydrostatic Model environmental predictions weather prediction climate prediction global regional multi-scale New development: Replaced GME at DWD in January 2015 as operational global model Hybrid implementation: MPI + OpenMP Joint development project of DWD and Max-Planck-Institute for Meteorology about 40 active developers from meteorology and computer science ~ 600,000 lines of Fortran code Lately joined by KIT to implement ICON-ART (environmental prediction) A regional mode is under development (will replace COSMO around 2020) Multi-Core Workshop, Boulder 10

ICON Parallel Scaling on ECMWF s XC30 ( ccb

2014) 5 km global resolution 20,971,520 grid

forecast, w/out reduced radiation grid, no

11 ICON Parallel Scaling on ECMWF s XC30 ( ccb ) Real-Data Test Setup (date: ) 5 km global resolution 20,971,520 grid cells hybrid run, 4 threads/task 1000 steps forecast, w/out reduced radiation grid, no output Multi-Core Workshop, Boulder 11

12 ICON in the High-Q Club To join the High-Q Club, the application has to scale across the full JUQUEEN, the 28-rack BlueGene/Q system at Jülich Supercomputer Center (JSC) with 458,752 cores (1,835,008 possible threads). A cloud-resolving (large-eddy simulation, LES) version of the ICON, which is developed within the High definition clouds and precipitation for advancing climate prediction, HD(CP)2, has been tested. A horizontal grid spacing of approximately 100m has been used. The group was able to show that the LES physics and the dynamical core scale well up to the full JUQUEEN machine Multi-Core Workshop, Boulder 12

13 GPU Developments for the COSMO-Model Multi-Core Workshop, Boulder 13

14 The COSMO GPU Project Started as part of the larger Swiss HP2C Initiative (High-Performance High- Productivity Computing). Aims of HP2C were (among others): to develop applications running efficiently on different architectures:multicore CPUs (x86) and GPUs to allow domain scientist to easily bring new developments Co-Design of the applications Porting Strategy: All COSMO components have low compute intensity: too costly to transfer data for selected components Full port strategy [1] Work is going on in the COSMO Priority Project POMPA (Performance on Massively Parallel Architectures), to implement all changes in the operational code [1] Fuhrer, O. et al., Supercomputing Frontiers and Innovations, 1, Multi-Core Workshop, Boulder 14

(STELLA) based on C++ Rest of the model is ported with OpenACC Communication

15 Porting Strategy Avoid CPU-GPU copies by executing all the time loop computations on the GPU Complete Re-Write of the dynamical core with a domain specific language (STELLA) based on C++ Rest of the model is ported with OpenACC Communication library (GCL) for GPU-GPU communications Multi-Core Workshop, Boulder 15

Testcase : 7 days in Jan. 2007, includes winter storm Kyrill (18.01.2007), PiZ daint 144 nodes GPU version of the code has been run daily for more than 1.

16 Ongoing Work / Applications GPU version of the code used for cloud-resolving climate simulations at European-scale (D. Leutwyler, ETHZ) Also used for a COSMO Project CALMO for calibration / tuning of the model. 33h simulation on prototype OPCODE system (8 GPUs). Testcase : 7 days in Jan. 2007, includes winter storm Kyrill ( ), PiZ daint 144 nodes GPU version of the code has been run daily for more than 1.5 years in Switzerland for validation Merge back changes into the official COSMO trunk (ongoing) GPU version of COSMO will be used for pre-operational forecasting at MeteoSwiss starting November 1, Multi-Core Workshop, Boulder 16

17 Performance Results COSMO-E (2.2 km) MeteoSwiss Ensemble configuration : 582x390x60 grid points, 120h, 21 members dynamics Physics: microphysics, radiation, sso, turbulence, soil, shallow convection, output Comparison (chip to chip): CPU: 8 Intel Xeon E GHz (each with 12 cores) on Cray XC40 Piz Dora at CSCS GPU: 4 NVIDIA Tesla K80 cards (each with 2 GK210 GPUs) on Cray CS-Storm cluster Piz Kesch at CSCS Multi-Core Workshop, Boulder 17

Time (s) Deutscher Wetterdienst Performance Results CPU: 8 Intel Xeon E5 2690 GPU: 4 NVIDIA Tesla K80 Overall run 2.1 x faster on GPU Physics, OpenACC, and optimizations: 2.

18 Time (s) Deutscher Wetterdienst Performance Results CPU: 8 Intel Xeon E GPU: 4 NVIDIA Tesla K80 Overall run 2.1 x faster on GPU Physics, OpenACC, and optimizations: 2.4 x faster on GPU Dynamics using STELLA library 2.4 x faster on GPU (also faster on CPU than original code) x2.1 Lower is better CPU-new DP GPU-new DP Timing for 1 COSMO-E member (120h, 582x390x60 grid points) Multi-Core Workshop, Boulder 18

19 Details of the Porting Multi-Core Workshop, Boulder 19

20 Complete Re-Write of the Dynamics STELLA is a domain specific language based on C++ (template meta programming) Has been developed by Computer Scientists in cooperation with MeteoSwiss and C2SM Backends available for x86 CPUs: C++ NVIDIA GPUs: Cuda Xeon Phi: a first implementation exists, but gives no satisfying performance results (waiting for KNL) It is going to be updated (will be GridTools then): for better usability to include also other applications (e.g. global weather and climate models such as ICON) Multi-Core Workshop, Boulder 20

21 Complete Re-Write of the Dynamics (II) Advantages You can learn new methods and a new programming style Programming with a DSL is architecture independent CPU version runs faster than Fortran version (usage of iterators, blocking) Disadvantages You have to learn new methods and a new programming style Low involvement of the actual developers of the dynamical core Backends have to be supported by specialists; not clear now, whether actual developers can do this. Fortran and C++ dynamical core will co-exist in the next 2-3 years! Multi-Core Workshop, Boulder 21

Physics Code optimized for GPU [2]: loop restructuring (reduce kernel overhead, improve reuse) scalar replacements on the fly computations (reduce memory accesses) manual caching (using scalars)

22 Physics Code optimized for GPU [2]: loop restructuring (reduce kernel overhead, improve reuse) scalar replacements on the fly computations (reduce memory accesses) manual caching (using scalars) replacement of local automatic arrays with global automatic arrays (avoid frequent memory allocation on the GPU) Some GPU optimizations degrade performance on CPU : keep separate routines when required (using ifdef) [2] Lapillonne and Fuhrer, Parallel Processing Letters, 24, Multi-Core Workshop, Boulder 22

23 Original code do k=2,nk do i=1,ni some code 1 c(i) = D*exp(a(i,k-1)) end do do i=1,ni Code Example a(i,k)=c(i)*a(i,k) some code 2 end do end do OpenACC 1 : Keep performance on CPU, not optimal on GPU OpenACC 2 : Optimal performance on GPU, may decrease perfromance on CPU OpenACC 1 do k=2,nk!$acc parallel!$acc loop gang vector do i=1,ni some code 1 c(i) = D*exp(a(i,k-1)) end do!$acc end parallel!$acc parallel!$acc loop gang vector do i=1,ni a(i,k)=c(i)*a(i,k) some code 2 end do!$acc end parallel end do OpenACC 2!$acc parallel!$acc loop gang vector do k=2,nk do i=1,ni some code 1 zc=d*exp(a(i,k-1)) a(i,k)=zc*a(i,k) some code 2 end do end do!$acc end parallel Multi-Core Workshop, Boulder 23

24 Speed up Example in the Radiation Deutscher Wetterdienst Considering key kernel of radiation Test domain 128x128x60, 1 Sandy Bridge CPU vs 1 K20x GPU 3,5 3 2,5 Higher is better GPU GPU 2 1,5 1 0,5 0 CPU Ref CPU opt (ref) CPU GPU opt Speed up with respect to reference code on CPU Radiation is the strongest example, on average in the physics, code optimized for GPU is 1.3x times slower when run on CPU Multi-Core Workshop, Boulder 24

25 Different Optimization Requirements: CPU vs. GPU CPU: Compute bound in the physics Compiler auto-vectorization: easier with small loop constructs Pre-computation GPU: memory bandwidth limited Benefit from large kernels : reduce kernel launch overhead, better computation/memory access overlap Loop re-ordering and scalar replacement On the fly computation Problematic for single source code: Example: Cloudice microphysics in ICON still uses the small kernels (for vectorization). COSMO uses the version with large kernels. But in principle they are the same. global automatic arrays for CPU version (where available memory might be a problem) Multi-Core Workshop, Boulder 25

26 Developers Wish List Want to use a single source code for all architectures and not want to use too many ifdefs keep code simple, portable, readable and similar to Fortran Dynamic memory management on GPUs Common / unified support for OpenACC by more compilers performance portability for OpenACC what is INTELs opinion on OpenACC? Multi-Core Workshop, Boulder 26

27 Thank you very much for your attention

An update on the COSMO- GPU developments

An update on the COSMO- GPU developments COSMO User Workshop 2014 X. Lapillonne, O. Fuhrer, A. Arteaga, S. Rüdisühli, C. Osuna, A. Roches and the COSMO- GPU team Eidgenössisches Departement des Innern