Physical parametrizations and OpenACC directives in COSMO

Size: px

Start display at page:

Download "Physical parametrizations and OpenACC directives in COSMO"

Martha Floyd
5 years ago
Views:

1 Physical parametrizations and OpenACC directives in COSMO Xavier Lapillonne Eidgenössisches Departement des Innern EDI Bundesamt für Meteorologie und Klimatologie MeteoSchweiz Name (change on Master slide) C2SM 1

Full GPU port : multiple strategies All code which uses grid data fields at every time step is ported to GPU Input/Setup keep on CPU / copy to GPU Δt Other (init step ) Physics Dynamics Assimilation

2 Full GPU port : multiple strategies All code which uses grid data fields at every time step is ported to GPU Input/Setup keep on CPU / copy to GPU Δt Other (init step ) Physics Dynamics Assimilation Halo-update Diagnostics Output OpenACC directives OpenACC directives C++ rewrite (uses CUDA) OpenACC directive (part on CPU) GPU-GPU communication library (GCL) OpenACC directives keep on CPU / copy from GPU (not every time step) Finalize 2

3 OpenACC compiler directives OpenACC : Open standard, supported by 3 compiler vendors PGI, Cray, Caps!$acc data copyout(a), copyin(b,c)!$acc parallel!$acc loop gang vector do i=1,n a(i)=b(i)+c(i) end do!$acc end parallel!$acc end data Control data location (GPU/CPU) Tell the compiler to generate GPU code Control over parallel execution Why using directives? GPU-Specific language (Ex: CUDA) Compiler directives Implementation, porting effort Re-write Incremental adaptation of existing code Control over parallel execution Yes Yes Control over memory hierarchy Yes No (automatic) Software management Often separate GPU and CPU code Single source possible Code maintenance/development Requires understanding of GPU computing and new language Retain original language, requires basic concept of GPU computing Target Architecture CUDA: Nvidia, OpenCL: multiple Multiple 3

4 Porting strategy Parallelization: horizontal direction, 1 thread per vertical column Most loop structures unchanged, one only adds directives : port large part of the code For time critical parts (mostly in the physics) additional optimizations: Loop restructuring Removal of local automatic array Scalar replacement Note : some of the GPU-optimized routines are slower on the CPU. For the moment we keep a separate GPU and CPU source for these parts. Directives in COSMO !$acc directives lines of code ported to GPU 1 thank to T. Diamanti, D. Leutwyler, C. Padrin, A. Roches, S. Schaffner. 4

5 Current status of Physics Scheme C-2 C-7 Status Remark microphysics - hydci_pp (ice scheme) - hydci_pp_gr (graupel) x sub grid scale oro. (sso) x x done radiation x x done x done ready (standalone) not ported yet in OPCODE turbulence x x done only operational options soil model - terra_multlay - terra1 - terra2 - seaice - flake_interface convection - conv_tiedtke - organize_conv_kainfri - conv_shallow x x x done Boundary exchange x x done x work in progress Ready done only operational options Needs an additional terms from dynamics 5

6 Performance results using directives : Physics Test domain 263x92x60, 1h simulation. Socket to Socket comparison : 8 core Intel Sandy Bridge CPU vs K20 GPU K20 GPU vs 1 Sandy Bridge CPU Speed up Cray PGI Physics Microphysics Radiation Turbulence Soil 90% of physics run time on original CPU code: microphysics, radiation and turbulence -> Higher optimization effort Overall speed up factor between 3x and 4x (depend also on input data) 6

7 Summary / Conclusions Large part of COSMO were successfully ported using OpenACC For the physics, additional optimizations were possible and we obtain good performance on GPU : 3x speed up on a node to node comparison An important part of the project (OPCODE) was relying on the PGI compiler, we should have tried to involve the PGI developer earlier A lot of time (most in fact) was spend trying to fix issues with compilers (new technology) There still are substantial differences between compilers (work in progress) The physics is now using different source files for CPU and GPU. We still need to determine how much could be shared 7

8 What s next Include missing physics in COSMO (deep convection, graupel microphysics, missing term from the dynamic) Further optimization / cleaning of the physics Investigate how much code can be shared between CPU and GPU How can the modified physics packages go back in the COSMO official trunk More directives in diagnostics / outputs Data assimilation (see next presentation) 8

9 Thank you Name (change on Master slide) C2SM 9

10 Performance results using directives : Physics Test domain 263x92x60, 1h simulation. Socket to Socket comparison : 8 core Intel Sandy Bridge CPU vs K20 GPU Physics Microphysics Radiation Turbulence Soil Sandy bridge Sandy bridge lblock_phys=t nproma= K20x (CCE) K20 (PGI)

Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs

Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs O. Fuhrer, T. Gysi, X. Lapillonne, C. Osuna, T. Dimanti, T. Schultess and the HP2C team Eidgenössisches