Deutscher Wetterdienst

Similar documents
Deutscher Wetterdienst

Physical parametrizations and OpenACC directives in COSMO

Porting COSMO to Hybrid Architectures

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs

Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015

CLAW FORTRAN Compiler source-to-source translation for performance portability

Deutscher Wetterdienst. Ulrich Schättler Deutscher Wetterdienst Research and Development

HPC Architectures. Types of resource currently in use

Porting the ICON Non-hydrostatic Dynamics and Physics to GPUs

An update on the COSMO- GPU developments

Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss. PP POMPA status.

PP POMPA (WG6) News and Highlights. Oliver Fuhrer (MeteoSwiss) and the whole POMPA project team. COSMO GM13, Sibiu

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

OP2 FOR MANY-CORE ARCHITECTURES

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

Illinois Proposal Considerations Greg Bauer

Bei Wang, Dmitry Prohorov and Carlos Rosales

Welcome. Virtual tutorial starts at BST

Directive-based Programming for Highly-scalable Nodes

Lecture 1: Why Parallelism? Parallel Computer Architecture and Programming CMU , Spring 2013

Improved Event Generation at NLO and NNLO. or Extending MCFM to include NNLO processes

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Cray XC Scalability and the Aries Network Tony Ford

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

ACCELERATION OF A COMPUTATIONAL FLUID DYNAMICS CODE WITH GPU USING OPENACC

Trends in HPC (hardware complexity and software challenges)

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Advances of parallel computing. Kirill Bogachev May 2016

CLAW FORTRAN Compiler Abstractions for Weather and Climate Models

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

Can Accelerators Really Accelerate Harmonie?

IFS migrates from IBM to Cray CPU, Comms and I/O

Status of the COSMO GPU version

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

CME 213 S PRING Eric Darve

Technology for a better society. hetcomp.com

Intel Knights Landing Hardware

Dynamical Core Rewrite

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

Accelerator programming with OpenACC

Benchmark results on Knight Landing (KNL) architecture

Using GPUs for ICON: An MPI and OpenACC Implementation

Experiences with ENZO on the Intel Many Integrated Core Architecture

John Levesque Nov 16, 2001

Porting the ICON Non-hydrostatic Dynamics to GPUs

High Performance Computing with Accelerators

The Red Storm System: Architecture, System Update and Performance Analysis

GPUs and Emerging Architectures

EULAG: high-resolution computational model for research of multi-scale geophysical fluid dynamics

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

S Comparing OpenACC 2.5 and OpenMP 4.5

Software within building physics and ground heat storage. HEAT3 version 7. A PC-program for heat transfer in three dimensions Update manual

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:

Our Workshop Environment

Accelerating Financial Applications on the GPU

Porting the microphysics model CASIM to GPU and KNL Cray machines

Parallel and Distributed Programming Introduction. Kenjiro Taura

The CLAW project. Valentin Clément, Xavier Lapillonne. CLAW provides high-level Abstractions for Weather and climate models

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Visualization of GRIB Files GrADS: Grid Analysis and Display System NCVIEW

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

OpenACC 2.6 Proposed Features

Early Experiences Writing Performance Portable OpenMP 4 Codes

GPU Developments for the NEMO Model. Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA

PGPROF OpenACC Tutorial

PROFILER OPENACC TUTORIAL. Version 2018

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

OpenStaPLE, an OpenACC Lattice QCD Application

GRID Testing and Profiling. November 2017

Introduction to GPU hardware and to CUDA

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Optimising the Mantevo benchmark suite for multi- and many-core architectures

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

HPC Architectures evolution: the case of Marconi, the new CINECA flagship system. Piero Lanucara

FAST FORWARD TO YOUR <NEXT> CREATION

VLPL-S Optimization on Knights Landing

ELP. Effektive Laufzeitunterstützung für zukünftige Programmierstandards. Speaker: Tim Cramer, RWTH Aachen University

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Real Parallel Computers

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

Dr. Ilia Bermous, the Australian Bureau of Meteorology. Acknowledgements to Dr. Martyn Corden (Intel), Dr. Zhang Zhang (Intel), Dr. Martin Dix (CSIRO)

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

Progress Report on QDP-JIT

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Transcription:

Accelerating Work at DWD Ulrich Schättler Deutscher Wetterdienst

Roadmap Porting operational models: revisited Preparations for enabling practical work at DWD My first steps with the COSMO on a GPU First experiences with COSMO on KNL Implications on further development and maintenance Conclusions 14.09.2016 2016 Mutli Core 6 Workshop 2

Porting Operational Models: Revisited Porting Strategy MeteoSwiss already ported full COSMO-Model to GPUs End of March 2016 they started operational runs with this version (which is based on COSMO-Model 4.19, now we have 5.03 with several significant changes) Process has started to implement GPU changes to the official COSMO-Model version The future of the STELLA re-write is not clear yet. 14.09.2016 3 2016 Mutli Core 6 Workshop

A Significant Change in the COSMO-Model In the last year we synchronized the physical parameterizations between the new global model ICON and the COSMO-Model to use the same source code. Because ICON only uses a one-dimensional vector to store horizontal fields, we had to change the data structure in COSMO for the parameterizations. A "copy-in/copy-out" mechanism has been implemented to transform all necessary fields between the parameterizations and the rest of the model (which still is ijk-structure) (i,j,k) data format (nproma,k) data format 14.09.2016 2016 Mutli Core 6 Workshop 4

COSMO-ICON Physics and GPUs Scheme Blocked Version GPU Microphysics yes no Radiation yes yes Subgrid-scale Orography no no Turbulence yes no Surface Schemes yes no Convection yes only shallow Blue: In COSMO and ICON Black: Only in COSMO 14.09.2016 2016 Mutli Core 6 Workshop 5

Preparations for Practical Work at DWD To support a model running on GPUs you should be able to let the model run on GPUs. But last year there was no possibility to do so at DWD. But I have a GPU in my desktop PC (even from NVIDIA): Device Name: NVS 315 Device Revision Number: 2.1 Global Memory Size: 1068171264 Number of Multiprocessors: 1 Number of Cores: 32 "Flexible and Energy efficient low profile solution with 1024 MB on board memory, providing display connectivity to drive any type of dual-display". CPU: Intel Core i7 4790 CPU @ 3.6 GHz 14.09.2016 2016 Mutli Core 6 Workshop 6

Preparations for Practical Work at DWD (II) Now I only missed a compiler: Cray is not available for desktop PCs, so I tried a PGI test licence: and that worked! Therefore we bought a server licence this year, which is also available for my colleagues. Duration of this process (from first test to installation of official compiler): 8 months 14.09.2016 2016 Mutli Core 6 Workshop 7

Preparations for Practical Work at DWD (III) End of 2015 the current contract with Cray has been extended to end of 2018. IvyBridge CPUs are replaced by Broadwell and some additional Broadwells are installed. The Haswell partition remains unchanged. This will give an extension of about 1.6 in the computational power. In addition a development cluster with 12 KNL nodes is delivered (installation and installation of software right now on its way) It will be run in flat mode 14.09.2016 2016 Mutli Core 6 Workshop 8

My First Steps with the COSMO-Model on a GPU Task: Implement the radiation interface between ijk- and blocked data structure and compute necessary input for radiation scheme The routines from the radiation scheme have been ported by Xavier Lapillonne from Switzerland Besides porting the loops (see right), you have to get all correct.!$acc data create!$acc copyin!$acc update device / host!$acc delete And after a few trials and errors: It worked! Temperatures at layer boundaries!$acc parallel!$acc loop gang vector collapse(3) DO k = 2, ke DO jp = 1, nradcoarse DO ip = 1, ipdim! get i/j indices for blocked structure i = mind_ilon_rad(ip,jp,ib) j = mind_jlat_rad(ip,jp,ib) zti(ip,k,jp) = & (t(i,j,k-1,ntl)*zphfo*(zphf - zpnf ) & + t(i,j,k,ntl)*zphf *(zpnf - zphfo))& * (1.0_wp/(zpnf *(zphf - zphfo))) ENDDO ENDDO ENDDO!$acc end parallel 14.09.2016 2016 Mutli Core 6 Workshop 9

What about the Performance? Tested 1 hour of forecast for a very small domain (41x39x40 grid points) on one CPU core and on the GPU (times given in seconds): Conclusion: Try to look for something different to do, which hopefully has nothing to do with computers. Or have some holidays at least. Scheme CPU GPU Total Time 15.46 132.26 Radiation 1.82 107.24 Update Device / Host - 1.59 14.09.2016 2016 Mutli Core 6 Workshop 10

Restarted Work after my Holidays Had to face some technical problems then: Workstation had to be booted, then the GUI did not work any more: need help from administrator (I have no root access). This is due to some "interface" problems between SUSE Linux distribution and CUDA 7.0 Visual Profiling (nvpp, pgprof) is not working any more. The model crashes with floating point exception in libcuinj64????? Our COSMO Support Team also reported several problems when installing the Swiss COSMO-GPU Version to a laptop. Problem is the connection between the Linux distribution, required CUDA libraries, gcc versions, etc. But compilation and running the model still worked. 14.09.2016 2016 Mutli Core 6 Workshop 11

Restarted Work after my Holidays (II) Tried to recall the problems reported by our Swiss colleagues: Allocation of local / automatic arrays on GPUs: This is not performant and should be avoided on GPUs. Therefore we implemented the possibility to use all local arrays as ALLOCATABLE and allocate them at the beginning of the program. This has been done and could not be the performance problem. Side remark: For OpenMP parallelization these variables have to be declared as "threadprivate". But then the Cray compiler refuses to vectorize loops with these variables!? Therefore we leave the possibility to have these as local arrays. (Did not report that to Cray up to now). 14.09.2016 2016 Mutli Core 6 Workshop 12

Restarted Work after my Holidays (III) Tried to recall the problems reported by our Swiss colleagues: Vector length: The GPU needs to have enough work The blocked data structure is not implemented with a fixed vector length but configurable The default value is nproma=16 How do other values influence the performance? 14.09.2016 2016 Mutli Core 6 Workshop 13

A First Success Scheme CPU GPU GPU GPU GPU nproma 16 16 32 128 1024 Total Time 15.46 132.26 36.12 21.43 18.24 Radiation 1.82 107.24 18.68 5.72 3.10 Update Device / Host - 1.59 1.59 1.63 1.62 Tests with a bigger domain showed the same behaviour, but bigger nproma then lead (on my "low profile" GPU) to Out of memory allocating 7045760 bytes of device memory Failing in Thread:1 total/free CUDA memory: 1068171264/6311936 14.09.2016 2016 Mutli Core 6 Workshop 14

First Experiences with COSMO on KNLs Our development cluster is only build up right now, therefore we have no own experiences But colleagues from the Meteorological Institute of the Ludwigs-Maximilians- University in Munich could install the COSMO code on a KNL node they have available. The following slide is provided by Leonhard Scheck and Robert Redl from LMU and shows some early work on KNL. 14.09.2016 2016 Mutli Core 6 Workshop 15

Benchmark: 3h COSMO run from COSMO RAPS 5.1 (domain size 221 x 219 grid points, 40 levels fits into 16GB MCDRAM) 64-core KNL-node (hybrid MCDRAM mode) vs. 2 x 14 core Xeon Haswell node Advantage of KNL: On-chip MCDRAM with 500GB/sec bandwidth node vector instructions MPI tasks Wall time [sec] KNL AVX2 32 132.2 KNL AVX512 32 131.1 KNL AVX2 64 102.6 KNL AVX512 64 97.0 KNL AVX2 128 87.4 KNL AVX512 128 80.8 KNL AVX2 256 111.8 KNL AVX512 256 110.1 Haswell AVX2 14 105.6 Haswell AVX2 28 87.7

Implications on Development and Maintenance Necessary code modifications for GPU: include many!$acc directives: but after a while you do not really "see" them any more (appear as comments) memory organization: could activate old Fortran77 memory manager! several ifdefs necessary (for example to exclude debug print outs) try to keep different versions for same code to a minimum (necessary due to performance issues) Code modifications for KNL are most probably also necessary (at least directives perhaps OpenMP parallelization necessary) We still hope to be able to maintain a single source code for all architectures! Really necessary now: An automated test suite to check different builds / configurations on different architectures for correctness: Also this has been developed at MeteoSwiss. 14.09.2016 2016 Mutli Core 6 Workshop 17

Conclusions At DWD we now have hardware and software available to test novel architectures. This should accelerate the work to test our models on GPUs and KNLs and to study the different programming models. Forecasts are always difficult, but most probably our next computer at DWD (to be purchased in 2018/19) will not be a pure GPU or a pure KNL machine. 14.09.2016 2016 Mutli Core 6 Workshop 18

Thank you very much for your attention