Improving the Energy- and Time-to-solution of COSMO-ART

Size: px

Start display at page:

Download "Improving the Energy- and Time-to-solution of COSMO-ART"

Melinda Cox
5 years ago
Views:

1 Joseph Charles, William Sawyer (ETH Zurich - CSCS) Heike Vogel (KIT), Bernhard Vogel (KIT), Teresa Beck (KIT/UHEI) COSMO User Workshop, MeteoSwiss January 18, 2016

2 Summary 2

3 Main Objectives Utilise project methodologies to attain x5 ETS improvement for COSMO-ART Code optimisations / refactoring on CPUs System software (other compilers, optimised libraries) New algorithms New architectures (GPUs, emerging CPUs, ARM) Technical challenges with a code under constant development Run configuration must be recreated in all subsequent versions Results must be reproducible within an expected variance Target application: COSMO-HAM (ETH Zurich) or COSMO-ART (KIT, EMPA)? Redefinition of baseline to reflect oversights, newer version of ART Management of different branches, validation, incorporations of version, e.g.: COSMO 4.28, COSMO-4.30, COSMO-5.0, COSMO-5.1_beta, OPCODE COSMO-5.1_beta Incongruities / incompatibilities between versions, e.g.: OPCODE COSMO based on 5.0, wasn't upgraded to 5.1 at the end of the project 3

4 Main Results WP5 Roadmap (Mar. 2014) Energy profiling of COSMO-ART baseline (ETH Zurich - CSCS / UHAM / UJI) Optimal setup for discretisation parameters, compilers (ETH Zurich - CSCS) Refactoring for CPUs (ETH Zurich - CSCS / IBM Research - Zurich) ODE Solver algorithmic changes (KIT / ETH Zurich - CSCS) Mixed-precision COSMO-ART (ETH Zurich - CSCS / KIT) Port COSMO-ART components to accelerators (ETH Zurich - CSCS) Feasibility study of a reduced model for gas phase chemistry (KIT) Investigation of possibilities of ART on ARM (UHEI) Milestones MS10 (M30) : Refactored COSMO-ART code prototype for CPUs and multi-core architectures MS11 (M36) : Performance model for ARM and other emerging hardware Deliverables D5.1 (M24) : Benchmarking report on energy requirements of the current COSMO-ART D5.2 (M30) : Refactored COSMO-ART code prototype for CPUs and multi-core architectures D5.3 (M36) : Final delivery of software prototypes, documentation, and summary report 4

5 Exploitable results COSMO-ART optimised version with respect to energy-to-solution Intellectual Property Rights (IPR): OPCODE COSMO : Open source with proprietary background IP. ART : Open source with proprietary background IP, available for scientific use after signing an agreement Usage scenario: COSMO-ART in a more cost-effective and energy-efficient manner on applicable hardware platforms Sector of application: Atmospheric Chemistry research One-moment graupel microphysics standalone C++ code using STELLA Intellectual Property Rights (IPR): Open source with proprietary components from COSMO Consortium Usage scenario: Assess potential performance improvement of COSMO component on multi-core CPU and GPU architectures from a single source code utilising STELLA framework Sector of application: Computational Science Box Model Test Framework for Kinetics PreProcessor (KPP) Intellectual Property Rights (IPR): Open source with proprietary background IP, additional licence for KPPA needed Usage scenario: Comparison of an existing KPP implementation in a given application with the same solvers generated by the KPPA proprietary software Sector of application: Computational Chemistry 5

6 Results Overview 6

7 COSMO-ART: Atmospheric Chemistry as Showcase Ref. Baseline, GNU compiler, 240 PEs COSMO TTS = s COSMO-ART TTS = 4, s Dynamics Physics MPI Comm. (Dyn.) MPI Sync. (Dyn.) MPI Comm. (Phy.) MPI Sync. (Phy.) Other Input Output Dynamics Physics MPI Comm. (Dyn.) MPI Sync. (Dyn.) MPI Comm. (Phy.) MPI Sync. (Phy.) Other Input Output ART COSMO: an ubiquitous weather forecast model in Europe Widespread use in federal weather forecast stations in Germany, Switzerland, Italy, Greece, Poland, Romania and Russia and large number of agencies including military and research institutions COSMO-ART: COSMO extended for Aerosols and Reactive Trace gases, e.g., air quality prediction Massive increase in computational expense due to atmospheric chemistry and additional tracers to advect (only relatively short simulation times currently viable) 7

8 Strategy Overview Aerosol Reactive Transport for atmospheric chemistry Optimisations for time-stepping in solvers generated by kinetics pre-processor (KPP) Proprietary KPP version generating multithreaded CPU and CUDA (GPU) code CPU/GPU-optimised version of COSMO NWP model 8

COSMO-ART : Baseline Energy-to-Solution Benchmark at Cabinet Level MONCH (CSCS ETH Zurich) - 1040 cores using 20 MPI tasks per node (realistic for

2GHz, equipped with 32GB of DDR3 1600MHz RAM and connected via InfiniBand Mellanox SX6036 and FDR switches.

9 COSMO-ART : Baseline Energy-to-Solution Benchmark at Cabinet Level MONCH (CSCS ETH Zurich) cores using 20 MPI tasks per node (realistic for production) 52 compute nodes were used, each comprised of two Intel Xeon Ivy Bridge EP E v2 ten-core processors operating at 2.2GHz, equipped with 32GB of DDR3 1600MHz RAM and connected via InfiniBand Mellanox SX6036 and FDR switches. This CPU architecture was considered state-of-the-art at beginning of the Exa2Green project Power Measurement System Model: Chauvin Arnoux PEL103 Clamp model: Miniflex MA193 Precision: ± 0.5% TTS = 1,681.6 s ETS = 21,182,799 J 9

10 COSMO-ART: Standalone KPP Test Framework D5.2 (M30): Refactored COSMO-ART code prototype for CPUs and multi-core architectures Two versions for an exclusive benchmarking of gas-phase chemistry 0-dim box model : identical calculation in all cells in the 3D domain extended box model : reads in temperature and chemical concentrations from real run in NetCDF format Single-node evaluation on a 66x56x31 test domain (114,576 grid cells) Piz Daint Cray XC30 (8-core Intel Xeon Sandy Bridge E CPU (2.6GHz) & Tesla K20X) Cray Power Management DataBase (PMDB) + pm_counters Sysfs files (updated at 10 Hz) TTS ETS reduction KPPA KPP KPP Serial TTS ETS OpenMP TTS ETS CUDA TTS ETS 0-dim box model x 1.3 x 1.4 x 1.0 x 1.0 x 3.5 x 5.4 x 25.5 x 23.5 x 33.3 x 18.8 extended box model x 1.4 x 1.4 x 1.0 x 1.0 x 3.4 x 5.3 x 22.3 x 23.3 x 23.2 x

11 COSMO-ART: Gas Phase Chemistry Optimisations Starting point: COSMO-ART (ref): initial reference baseline based on COSMO_4.30 Mixed- and single-precision, e.g., COSMO-ART (sp-dp): mixed precision version based on COSMO_5.1 (beta) and ART_3.0 COSMO-ART (sp): single precision version based on COSMO_5.1 (beta) and ART_3.0 PRACE 2IP WP8 Integrator (G. Fanourgakis, J. Lelieveld and D. Taraborelli) Time-step control: as proposed by Söderlind Positive definition: artificial preservation of positivity to improve stability COSMO-ART (sp, PRACE): based on COSMO_5.1 (beta) and ART_3.0 with pos. def., new time-step control COSMO-ART (sp, PRACE, KPPA): same as above but based on KPPA Replace COSMO with OPCODE COSMO (HP2C project with CPU and GPU support) Slightly different configuration : Requires revised shallow convection scheme Semi-Lagrangian advection scheme (SL3_SC) is slightly different than original SL3_SFD Requires adapted radiation scheme (roughly same run-time) Results now scientifically validated by H. Vogel (KIT) and J. Charles (CSCS) COSMO-ART (sp, PRACE, OPCODE): limited to Cray compiler (because of GPU components) 11

COSMO-ART: Preliminary Benchmarking Proof-of-concept benchmarking on two computing platforms at ETH Zurich CSCS : Piz Daint: Cray XC30-8-core Intel Xeon Sandy Bridge E5-2670 CPU (2.

12 COSMO-ART: Preliminary Benchmarking Proof-of-concept benchmarking on two computing platforms at ETH Zurich CSCS : Piz Daint: Cray XC30-8-core Intel Xeon Sandy Bridge E CPU (2.6GHz) per compute node Piz Dora: Cray XC40 - two 12-core Intel Xeon Haswell E v3 CPUs (2.6GHz) per compute node For both : Cray Power Management DataBase (PMDB) + pm_counters Sysfs files (updated at 10 Hz) Remarks: 24h simulation using 288 PEs and the GNU compiler (but: Cray -O2 for OPCODE) COSMO_5.1 (beta) provided by O. Fuhrer and X. Lapillonne (MeteoSwiss) supports a generic tracer transport mechanism for prognostic variables allows a flexible definition of new tracers ART_3.0 provided by H. & B. Vogel (KIT) extensive support from H. & B. Vogel (KIT) for debugging 12

13 COSMO-ART: OPCODE, OpenMP, P.I., Piz Dora ; Intermediate result Constant MPI decomposition (192 processes), variable #nodes and threads Energy Energy-to-solution (J) Time-to-solution (s) Time 0 N=8 N=16 #MPI=24 #MPI=12 1,2 th. 1,2,4 th. N=24 #MPI=8 1,2,6 th. N=48 #MPI=4 1,2,4,6,12 th. N=96 #MPI=2 1,2,4,6,12,24 th. Bottom line: Optimal ETS is on minimal #nodes, with each core running 1 MPI process 13

14 Baseline vs. Final code version : OPCODE COSMO-ART, SP, P.I. Comparison with 1040 cores (=MPI processes) in both cases Energy : CPU + Interconnect + Blowers + AC/DC Conversion 14

15 Crosscutting Activities with other teams 15

4 GHz) => 192 MPI processes Power Measurement System (UHAM) ACP8653 Power Distribution Units (PDUs) with 1 S/s and ± 3% accuracy High resolution power-performance tracing framework Extrae

16 Performance/Energy-Efficiency Analysis (UHAM/UJI) D5.1 (M24): Benchmarking report on energy requirements of the current COSMO-ART TINTORRUM (UJI) 16 nodes with 2x Intel Westmere E5645 hex-core (2.4 GHz) => 192 MPI processes Power Measurement System (UHAM) ACP8653 Power Distribution Units (PDUs) with 1 S/s and ± 3% accuracy High resolution power-performance tracing framework Extrae instrumentation library + pmlib tracing server + Paraver Visualise and correlate tasks traces with power profile Software Environment COSMO-ART baseline (initial model setup) OpenMPI 1.6.5: 192 cores using 12 MPI processes per node Two MPI policies (UJI) Aggressive : CPU busy-waiting for incoming message Degraded:repeated calls to sched_yield(), picked by the OS 16

Impact WP2 (IBM) results on Showcase: KPP on POWER8 Key Aspect : Problem is decoupled in space each point has set of ODEs solved with KPP Considered optimisations: High thread parallelism software

applicable) Iterative refinement for the linear system, e.g.

17 Impact WP2 (IBM) results on Showcase: KPP on POWER8 Key Aspect : Problem is decoupled in space each point has set of ODEs solved with KPP Considered optimisations: High thread parallelism software optimisation (left, baseline) Loop merging (center) Fast exponential, logarithm, and power evaluation for coefficients of chemical reactions (right, IBM-specific) Transactional memory (was not applicable) Iterative refinement for the linear system, e.g., LU in low precision residual in high precision may be pursued in future Time reduction: 1 thread per core -39%, 8 threads per core -68 % Power increase: 1 thread per core +1%, 8 threads per core +15 % Energy reduction: 1 thread per core -30%, 8 threads per core -58% Unfortunately even the IBM-nonspecific optimisations did not yield performance improvement on target Piz Dora Intel Haswell platform 17

18 Model Reduction of Atm. Chem. Kinetics (UHEI/KIT) Roadmap point #7: Feasibility Study Investigation of popular approaches: Removal of species Lumping into pseudo-species Time-scale separation Repro-modelling and functional representation Assessment of the feasibility within COSMO-ART Focus on Repro-modelling : High-Dimensional Model Representation (HDMR) Implementation and Testing of HDMR : 0D box model : atmospheric chemistry test problem (Kuhn et al., 1998) Results : HDMR models can be tailored to meet any accuracy requirements a the price of higher computing demands for their (a-priori) construction and evaluation HDMR predictions with acceptable accuracy save up to 99% of computing time vs. Rosenbrock Conclusions : HDMR offers a promising approach to reduce time-/energy demands for ART chemical kinetics Further investigation needed to construct optimal HDMR expansions, requiring expert knowledge Other Crosscutting Results : investigate suitability of asynchronous iteration and multigrid methods COSMO-ART mathematical properties and problem size were not suitable for these techniques 18

19 GPU Results 19

GPU proofs of concept 1) Replacement of COSMO by OPCODE COSMO (CPU/GPU-enabled) (4 nodes)

reduction KPP-2.2.1 TTS ETS KPP-2.2.3 TTS ETS KPPA-0.2.1 Serial TTS ETS OpenMP TTS ETS CUDA TTS ETS 0-dim box model x 1.

20 GPU proofs of concept 1) Replacement of COSMO by OPCODE COSMO (CPU/GPU-enabled) (4 nodes) COSMO-ART TTS Dynamics Physics MPI Comm. (Dyn.) MPI Sync. (Dyn.) 2) Extended box model : utilisation of (CPU/GPU-enabled) KPPA solvers (single node) TTS ETS reduction KPP TTS ETS KPP TTS ETS KPPA Serial TTS ETS OpenMP TTS ETS CUDA TTS ETS 0-dim box model x 1.3 x 1.4 x 1.0 x 1.0 x 3.5 x 5.4 x 25.5 x 23.5 x 33.3 x 18.8 extended box model x 1.4 x 1.4 x 1.0 x 1.0 x 3.4 x 5.3 x 22.3 x 23.3 x 23.2 x ) Utilisation of CPU/GPU-enabled STELLA for Graupel Microphysics 20

21 Bottom Line Summary Planned : 5x ETS improvement in full COSMO-ART benchmark Achieved 3.3x: with OPCODE COSMO, algorithmic improvements, on typical configuration (1040 cores on Piz Dora platform, 44 dual-socket Intel Haswell CPU nodes). Valuable contribution to atmospheric chemistry community For GPU platforms: component benchmarks indicate additional factor >1.6x possible GPU implementation of end-to-end COSMO-ART not completed (unfortunately for CSCS) KPPA had unresolved issues when run in COSMO-ART context Software management issues in merge more time-consuming than expected Results : COSMO-ART community has immediate benefits with the new code for CPU platforms Three exploitable results delivered : STELLA microphysics (CPU/GPU), box model test framework (CPU/GPU), OPCODE COSMO-ART SP with PRACE Integrator (CPU-only) ARM platform was tested with box model (T5.3); result : GPU architectures more promising Enriching collaborations with Exa2Green partners, e.g., ART development (KIT), power monitoring (UHAM/UJI), box model optimisations (IBM), model reduction (UHEI/KIT) The full documentation is available on: 21

Evaluating the Performance and Energy Efficiency of the COSMO-ART Model System

Evaluating the Performance and Energy Efficiency of the COSMO-ART Model System Joseph Charles & William Sawyer (CSCS), Manuel F. Dolz (UHAM), Sandra Catalán (UJI) EnA-HPC, Dresden September 1-2, 2014 1