Deutscher Wetterdienst

Size: px

Start display at page:

Download "Deutscher Wetterdienst"

Annabella Hunter
6 years ago
Views:

1 Accelerating Work at DWD Ulrich Schättler Deutscher Wetterdienst

2 Roadmap Porting operational models: revisited Preparations for enabling practical work at DWD My first steps with the COSMO on a GPU First experiences with COSMO on KNL Implications on further development and maintenance Conclusions Mutli Core 6 Workshop 2

Porting Operational Models: Revisited Porting Strategy MeteoSwiss already ported full COSMO-Model to GPUs End of March 2016 they started operational runs with this version (which is based on

3 Porting Operational Models: Revisited Porting Strategy MeteoSwiss already ported full COSMO-Model to GPUs End of March 2016 they started operational runs with this version (which is based on COSMO-Model 4.19, now we have 5.03 with several significant changes) Process has started to implement GPU changes to the official COSMO-Model version The future of the STELLA re-write is not clear yet Mutli Core 6 Workshop

4 A Significant Change in the COSMO-Model In the last year we synchronized the physical parameterizations between the new global model ICON and the COSMO-Model to use the same source code. Because ICON only uses a one-dimensional vector to store horizontal fields, we had to change the data structure in COSMO for the parameterizations. A "copy-in/copy-out" mechanism has been implemented to transform all necessary fields between the parameterizations and the rest of the model (which still is ijk-structure) (i,j,k) data format (nproma,k) data format Mutli Core 6 Workshop 4

5 COSMO-ICON Physics and GPUs Scheme Blocked Version GPU Microphysics yes no Radiation yes yes Subgrid-scale Orography no no Turbulence yes no Surface Schemes yes no Convection yes only shallow Blue: In COSMO and ICON Black: Only in COSMO Mutli Core 6 Workshop 5

Preparations for Practical Work at DWD To support a model running on GPUs you should be able to let the model run on GPUs. But last year there was no possibility to do so at DWD.

6 Preparations for Practical Work at DWD To support a model running on GPUs you should be able to let the model run on GPUs. But last year there was no possibility to do so at DWD. But I have a GPU in my desktop PC (even from NVIDIA): Device Name: NVS 315 Device Revision Number: 2.1 Global Memory Size: Number of Multiprocessors: 1 Number of Cores: 32 "Flexible and Energy efficient low profile solution with 1024 MB on board memory, providing display connectivity to drive any type of dual-display". CPU: Intel Core i GHz Mutli Core 6 Workshop 6

Preparations for Practical Work at DWD (II) Now I only missed a compiler: Cray is not available for desktop PCs, so I tried a PGI test licence: and that worked!

7 Preparations for Practical Work at DWD (II) Now I only missed a compiler: Cray is not available for desktop PCs, so I tried a PGI test licence: and that worked! Therefore we bought a server licence this year, which is also available for my colleagues. Duration of this process (from first test to installation of official compiler): 8 months Mutli Core 6 Workshop 7

Preparations for Practical Work at DWD (III) End of 2015 the current contract with Cray has been extended to end of 2018.

8 Preparations for Practical Work at DWD (III) End of 2015 the current contract with Cray has been extended to end of IvyBridge CPUs are replaced by Broadwell and some additional Broadwells are installed. The Haswell partition remains unchanged. This will give an extension of about 1.6 in the computational power. In addition a development cluster with 12 KNL nodes is delivered (installation and installation of software right now on its way) It will be run in flat mode Mutli Core 6 Workshop 8

9 My First Steps with the COSMO-Model on a GPU Task: Implement the radiation interface between ijk- and blocked data structure and compute necessary input for radiation scheme The routines from the radiation scheme have been ported by Xavier Lapillonne from Switzerland Besides porting the loops (see right), you have to get all correct.!$acc data create!$acc copyin!$acc update device / host!$acc delete And after a few trials and errors: It worked! Temperatures at layer boundaries!$acc parallel!$acc loop gang vector collapse(3) DO k = 2, ke DO jp = 1, nradcoarse DO ip = 1, ipdim! get i/j indices for blocked structure i = mind_ilon_rad(ip,jp,ib) j = mind_jlat_rad(ip,jp,ib) zti(ip,k,jp) = & (t(i,j,k-1,ntl)*zphfo*(zphf - zpnf ) & + t(i,j,k,ntl)*zphf *(zpnf - zphfo))& * (1.0_wp/(zpnf *(zphf - zphfo))) ENDDO ENDDO ENDDO!$acc end parallel Mutli Core 6 Workshop 9

10 What about the Performance? Tested 1 hour of forecast for a very small domain (41x39x40 grid points) on one CPU core and on the GPU (times given in seconds): Conclusion: Try to look for something different to do, which hopefully has nothing to do with computers. Or have some holidays at least. Scheme CPU GPU Total Time Radiation Update Device / Host Mutli Core 6 Workshop 10

11 Restarted Work after my Holidays Had to face some technical problems then: Workstation had to be booted, then the GUI did not work any more: need help from administrator (I have no root access). This is due to some "interface" problems between SUSE Linux distribution and CUDA 7.0 Visual Profiling (nvpp, pgprof) is not working any more. The model crashes with floating point exception in libcuinj64????? Our COSMO Support Team also reported several problems when installing the Swiss COSMO-GPU Version to a laptop. Problem is the connection between the Linux distribution, required CUDA libraries, gcc versions, etc. But compilation and running the model still worked Mutli Core 6 Workshop 11

12 Restarted Work after my Holidays (II) Tried to recall the problems reported by our Swiss colleagues: Allocation of local / automatic arrays on GPUs: This is not performant and should be avoided on GPUs. Therefore we implemented the possibility to use all local arrays as ALLOCATABLE and allocate them at the beginning of the program. This has been done and could not be the performance problem. Side remark: For OpenMP parallelization these variables have to be declared as "threadprivate". But then the Cray compiler refuses to vectorize loops with these variables!? Therefore we leave the possibility to have these as local arrays. (Did not report that to Cray up to now) Mutli Core 6 Workshop 12

13 Restarted Work after my Holidays (III) Tried to recall the problems reported by our Swiss colleagues: Vector length: The GPU needs to have enough work The blocked data structure is not implemented with a fixed vector length but configurable The default value is nproma=16 How do other values influence the performance? Mutli Core 6 Workshop 13

14 A First Success Scheme CPU GPU GPU GPU GPU nproma Total Time Radiation Update Device / Host Tests with a bigger domain showed the same behaviour, but bigger nproma then lead (on my "low profile" GPU) to Out of memory allocating bytes of device memory Failing in Thread:1 total/free CUDA memory: / Mutli Core 6 Workshop 14

15 First Experiences with COSMO on KNLs Our development cluster is only build up right now, therefore we have no own experiences But colleagues from the Meteorological Institute of the Ludwigs-Maximilians- University in Munich could install the COSMO code on a KNL node they have available. The following slide is provided by Leonhard Scheck and Robert Redl from LMU and shows some early work on KNL Mutli Core 6 Workshop 15

16 Benchmark: 3h COSMO run from COSMO RAPS 5.1 (domain size 221 x 219 grid points, 40 levels fits into 16GB MCDRAM) 64-core KNL-node (hybrid MCDRAM mode) vs. 2 x 14 core Xeon Haswell node Advantage of KNL: On-chip MCDRAM with 500GB/sec bandwidth node vector instructions MPI tasks Wall time [sec] KNL AVX KNL AVX KNL AVX KNL AVX KNL AVX KNL AVX KNL AVX KNL AVX Haswell AVX Haswell AVX

17 Implications on Development and Maintenance Necessary code modifications for GPU: include many!$acc directives: but after a while you do not really "see" them any more (appear as comments) memory organization: could activate old Fortran77 memory manager! several ifdefs necessary (for example to exclude debug print outs) try to keep different versions for same code to a minimum (necessary due to performance issues) Code modifications for KNL are most probably also necessary (at least directives perhaps OpenMP parallelization necessary) We still hope to be able to maintain a single source code for all architectures! Really necessary now: An automated test suite to check different builds / configurations on different architectures for correctness: Also this has been developed at MeteoSwiss Mutli Core 6 Workshop 17

18 Conclusions At DWD we now have hardware and software available to test novel architectures. This should accelerate the work to test our models on GPUs and KNLs and to study the different programming models. Forecasts are always difficult, but most probably our next computer at DWD (to be purchased in 2018/19) will not be a pure GPU or a pure KNL machine Mutli Core 6 Workshop 18

19 Thank you very much for your attention

Deutscher Wetterdienst

Porting Operational Models to Multi- and Many-Core Architectures Ulrich Schättler Deutscher Wetterdienst Oliver Fuhrer MeteoSchweiz Xavier Lapillonne MeteoSchweiz Contents Strong Scalability of the Operational