Deutscher Wetterdienst - PDF Free Download

Accelerating Work at DWD Ulrich Schättler Deutscher Wetterdienst

Roadmap Porting operational models: revisited Preparations for enabling practical work at DWD My first steps with the COSMO on a GPU First experiences with COSMO on KNL Implications on further development and maintenance Conclusions 14.09.2016 2016 Mutli Core 6 Workshop 2

Porting Operational Models: Revisited Porting Strategy MeteoSwiss already ported full COSMO-Model to GPUs End of March 2016 they started operational runs with this version (which is based on COSMO-Model 4.19, now we have 5.03 with several significant changes) Process has started to implement GPU changes to the official COSMO-Model version The future of the STELLA re-write is not clear yet. 14.09.2016 3 2016 Mutli Core 6 Workshop

A Significant Change in the COSMO-Model In the last year we synchronized the physical parameterizations between the new global model ICON and the COSMO-Model to use the same source code. Because ICON only uses a one-dimensional vector to store horizontal fields, we had to change the data structure in COSMO for the parameterizations. A "copy-in/copy-out" mechanism has been implemented to transform all necessary fields between the parameterizations and the rest of the model (which still is ijk-structure) (i,j,k) data format (nproma,k) data format 14.09.2016 2016 Mutli Core 6 Workshop 4

COSMO-ICON Physics and GPUs Scheme Blocked Version GPU Microphysics yes no Radiation yes yes Subgrid-scale Orography no no Turbulence yes no Surface Schemes yes no Convection yes only shallow Blue: In COSMO and ICON Black: Only in COSMO 14.09.2016 2016 Mutli Core 6 Workshop 5

Preparations for Practical Work at DWD To support a model running on GPUs you should be able to let the model run on GPUs. But last year there was no possibility to do so at DWD. But I have a GPU in my desktop PC (even from NVIDIA): Device Name: NVS 315 Device Revision Number: 2.1 Global Memory Size: 1068171264 Number of Multiprocessors: 1 Number of Cores: 32 "Flexible and Energy efficient low profile solution with 1024 MB on board memory, providing display connectivity to drive any type of dual-display". CPU: Intel Core i7 4790 CPU @ 3.6 GHz 14.09.2016 2016 Mutli Core 6 Workshop 6

Preparations for Practical Work at DWD (II) Now I only missed a compiler: Cray is not available for desktop PCs, so I tried a PGI test licence: and that worked! Therefore we bought a server licence this year, which is also available for my colleagues. Duration of this process (from first test to installation of official compiler): 8 months 14.09.2016 2016 Mutli Core 6 Workshop 7

Preparations for Practical Work at DWD (III) End of 2015 the current contract with Cray has been extended to end of 2018. IvyBridge CPUs are replaced by Broadwell and some additional Broadwells are installed. The Haswell partition remains unchanged. This will give an extension of about 1.6 in the computational power. In addition a development cluster with 12 KNL nodes is delivered (installation and installation of software right now on its way) It will be run in flat mode 14.09.2016 2016 Mutli Core 6 Workshop 8

My First Steps with the COSMO-Model on a GPU Task: Implement the radiation interface between ijk- and blocked data structure and compute necessary input for radiation scheme The routines from the radiation scheme have been ported by Xavier Lapillonne from Switzerland Besides porting the loops (see right), you have to get all correct.!$acc data create!$acc copyin!$acc update device / host!$acc delete And after a few trials and errors: It worked! Temperatures at layer boundaries!$acc parallel!$acc loop gang vector collapse(3) DO k = 2, ke DO jp = 1, nradcoarse DO ip = 1, ipdim! get i/j indices for blocked structure i = mind_ilon_rad(ip,jp,ib) j = mind_jlat_rad(ip,jp,ib) zti(ip,k,jp) = & (t(i,j,k-1,ntl)*zphfo*(zphf - zpnf ) & + t(i,j,k,ntl)*zphf *(zpnf - zphfo))& * (1.0_wp/(zpnf *(zphf - zphfo))) ENDDO ENDDO ENDDO!$acc end parallel 14.09.2016 2016 Mutli Core 6 Workshop 9

What about the Performance? Tested 1 hour of forecast for a very small domain (41x39x40 grid points) on one CPU core and on the GPU (times given in seconds): Conclusion: Try to look for something different to do, which hopefully has nothing to do with computers. Or have some holidays at least. Scheme CPU GPU Total Time 15.46 132.26 Radiation 1.82 107.24 Update Device / Host - 1.59 14.09.2016 2016 Mutli Core 6 Workshop 10

Restarted Work after my Holidays Had to face some technical problems then: Workstation had to be booted, then the GUI did not work any more: need help from administrator (I have no root access). This is due to some "interface" problems between SUSE Linux distribution and CUDA 7.0 Visual Profiling (nvpp, pgprof) is not working any more. The model crashes with floating point exception in libcuinj64????? Our COSMO Support Team also reported several problems when installing the Swiss COSMO-GPU Version to a laptop. Problem is the connection between the Linux distribution, required CUDA libraries, gcc versions, etc. But compilation and running the model still worked. 14.09.2016 2016 Mutli Core 6 Workshop 11

Restarted Work after my Holidays (II) Tried to recall the problems reported by our Swiss colleagues: Allocation of local / automatic arrays on GPUs: This is not performant and should be avoided on GPUs. Therefore we implemented the possibility to use all local arrays as ALLOCATABLE and allocate them at the beginning of the program. This has been done and could not be the performance problem. Side remark: For OpenMP parallelization these variables have to be declared as "threadprivate". But then the Cray compiler refuses to vectorize loops with these variables!? Therefore we leave the possibility to have these as local arrays. (Did not report that to Cray up to now). 14.09.2016 2016 Mutli Core 6 Workshop 12

Restarted Work after my Holidays (III) Tried to recall the problems reported by our Swiss colleagues: Vector length: The GPU needs to have enough work The blocked data structure is not implemented with a fixed vector length but configurable The default value is nproma=16 How do other values influence the performance? 14.09.2016 2016 Mutli Core 6 Workshop 13

A First Success Scheme CPU GPU GPU GPU GPU nproma 16 16 32 128 1024 Total Time 15.46 132.26 36.12 21.43 18.24 Radiation 1.82 107.24 18.68 5.72 3.10 Update Device / Host - 1.59 1.59 1.63 1.62 Tests with a bigger domain showed the same behaviour, but bigger nproma then lead (on my "low profile" GPU) to Out of memory allocating 7045760 bytes of device memory Failing in Thread:1 total/free CUDA memory: 1068171264/6311936 14.09.2016 2016 Mutli Core 6 Workshop 14

First Experiences with COSMO on KNLs Our development cluster is only build up right now, therefore we have no own experiences But colleagues from the Meteorological Institute of the Ludwigs-Maximilians- University in Munich could install the COSMO code on a KNL node they have available. The following slide is provided by Leonhard Scheck and Robert Redl from LMU and shows some early work on KNL. 14.09.2016 2016 Mutli Core 6 Workshop 15

Benchmark: 3h COSMO run from COSMO RAPS 5.1 (domain size 221 x 219 grid points, 40 levels fits into 16GB MCDRAM) 64-core KNL-node (hybrid MCDRAM mode) vs. 2 x 14 core Xeon Haswell node Advantage of KNL: On-chip MCDRAM with 500GB/sec bandwidth node vector instructions MPI tasks Wall time [sec] KNL AVX2 32 132.2 KNL AVX512 32 131.1 KNL AVX2 64 102.6 KNL AVX512 64 97.0 KNL AVX2 128 87.4 KNL AVX512 128 80.8 KNL AVX2 256 111.8 KNL AVX512 256 110.1 Haswell AVX2 14 105.6 Haswell AVX2 28 87.7

Implications on Development and Maintenance Necessary code modifications for GPU: include many!$acc directives: but after a while you do not really "see" them any more (appear as comments) memory organization: could activate old Fortran77 memory manager! several ifdefs necessary (for example to exclude debug print outs) try to keep different versions for same code to a minimum (necessary due to performance issues) Code modifications for KNL are most probably also necessary (at least directives perhaps OpenMP parallelization necessary) We still hope to be able to maintain a single source code for all architectures! Really necessary now: An automated test suite to check different builds / configurations on different architectures for correctness: Also this has been developed at MeteoSwiss. 14.09.2016 2016 Mutli Core 6 Workshop 17

Conclusions At DWD we now have hardware and software available to test novel architectures. This should accelerate the work to test our models on GPUs and KNLs and to study the different programming models. Forecasts are always difficult, but most probably our next computer at DWD (to be purchased in 2018/19) will not be a pure GPU or a pure KNL machine. 14.09.2016 2016 Mutli Core 6 Workshop 18

Thank you very much for your attention