Improving the Energy- and Time-to-solution of COSMO-ART
|
|
- Melinda Cox
- 5 years ago
- Views:
Transcription
1 Joseph Charles, William Sawyer (ETH Zurich - CSCS) Heike Vogel (KIT), Bernhard Vogel (KIT), Teresa Beck (KIT/UHEI) COSMO User Workshop, MeteoSwiss January 18, 2016
2 Summary 2
3 Main Objectives Utilise project methodologies to attain x5 ETS improvement for COSMO-ART Code optimisations / refactoring on CPUs System software (other compilers, optimised libraries) New algorithms New architectures (GPUs, emerging CPUs, ARM) Technical challenges with a code under constant development Run configuration must be recreated in all subsequent versions Results must be reproducible within an expected variance Target application: COSMO-HAM (ETH Zurich) or COSMO-ART (KIT, EMPA)? Redefinition of baseline to reflect oversights, newer version of ART Management of different branches, validation, incorporations of version, e.g.: COSMO 4.28, COSMO-4.30, COSMO-5.0, COSMO-5.1_beta, OPCODE COSMO-5.1_beta Incongruities / incompatibilities between versions, e.g.: OPCODE COSMO based on 5.0, wasn't upgraded to 5.1 at the end of the project 3
4 Main Results WP5 Roadmap (Mar. 2014) Energy profiling of COSMO-ART baseline (ETH Zurich - CSCS / UHAM / UJI) Optimal setup for discretisation parameters, compilers (ETH Zurich - CSCS) Refactoring for CPUs (ETH Zurich - CSCS / IBM Research - Zurich) ODE Solver algorithmic changes (KIT / ETH Zurich - CSCS) Mixed-precision COSMO-ART (ETH Zurich - CSCS / KIT) Port COSMO-ART components to accelerators (ETH Zurich - CSCS) Feasibility study of a reduced model for gas phase chemistry (KIT) Investigation of possibilities of ART on ARM (UHEI) Milestones MS10 (M30) : Refactored COSMO-ART code prototype for CPUs and multi-core architectures MS11 (M36) : Performance model for ARM and other emerging hardware Deliverables D5.1 (M24) : Benchmarking report on energy requirements of the current COSMO-ART D5.2 (M30) : Refactored COSMO-ART code prototype for CPUs and multi-core architectures D5.3 (M36) : Final delivery of software prototypes, documentation, and summary report 4
5 Exploitable results COSMO-ART optimised version with respect to energy-to-solution Intellectual Property Rights (IPR): OPCODE COSMO : Open source with proprietary background IP. ART : Open source with proprietary background IP, available for scientific use after signing an agreement Usage scenario: COSMO-ART in a more cost-effective and energy-efficient manner on applicable hardware platforms Sector of application: Atmospheric Chemistry research One-moment graupel microphysics standalone C++ code using STELLA Intellectual Property Rights (IPR): Open source with proprietary components from COSMO Consortium Usage scenario: Assess potential performance improvement of COSMO component on multi-core CPU and GPU architectures from a single source code utilising STELLA framework Sector of application: Computational Science Box Model Test Framework for Kinetics PreProcessor (KPP) Intellectual Property Rights (IPR): Open source with proprietary background IP, additional licence for KPPA needed Usage scenario: Comparison of an existing KPP implementation in a given application with the same solvers generated by the KPPA proprietary software Sector of application: Computational Chemistry 5
6 Results Overview 6
7 COSMO-ART: Atmospheric Chemistry as Showcase Ref. Baseline, GNU compiler, 240 PEs COSMO TTS = s COSMO-ART TTS = 4, s Dynamics Physics MPI Comm. (Dyn.) MPI Sync. (Dyn.) MPI Comm. (Phy.) MPI Sync. (Phy.) Other Input Output Dynamics Physics MPI Comm. (Dyn.) MPI Sync. (Dyn.) MPI Comm. (Phy.) MPI Sync. (Phy.) Other Input Output ART COSMO: an ubiquitous weather forecast model in Europe Widespread use in federal weather forecast stations in Germany, Switzerland, Italy, Greece, Poland, Romania and Russia and large number of agencies including military and research institutions COSMO-ART: COSMO extended for Aerosols and Reactive Trace gases, e.g., air quality prediction Massive increase in computational expense due to atmospheric chemistry and additional tracers to advect (only relatively short simulation times currently viable) 7
8 Strategy Overview Aerosol Reactive Transport for atmospheric chemistry Optimisations for time-stepping in solvers generated by kinetics pre-processor (KPP) Proprietary KPP version generating multithreaded CPU and CUDA (GPU) code CPU/GPU-optimised version of COSMO NWP model 8
9 COSMO-ART : Baseline Energy-to-Solution Benchmark at Cabinet Level MONCH (CSCS ETH Zurich) cores using 20 MPI tasks per node (realistic for production) 52 compute nodes were used, each comprised of two Intel Xeon Ivy Bridge EP E v2 ten-core processors operating at 2.2GHz, equipped with 32GB of DDR3 1600MHz RAM and connected via InfiniBand Mellanox SX6036 and FDR switches. This CPU architecture was considered state-of-the-art at beginning of the Exa2Green project Power Measurement System Model: Chauvin Arnoux PEL103 Clamp model: Miniflex MA193 Precision: ± 0.5% TTS = 1,681.6 s ETS = 21,182,799 J 9
10 COSMO-ART: Standalone KPP Test Framework D5.2 (M30): Refactored COSMO-ART code prototype for CPUs and multi-core architectures Two versions for an exclusive benchmarking of gas-phase chemistry 0-dim box model : identical calculation in all cells in the 3D domain extended box model : reads in temperature and chemical concentrations from real run in NetCDF format Single-node evaluation on a 66x56x31 test domain (114,576 grid cells) Piz Daint Cray XC30 (8-core Intel Xeon Sandy Bridge E CPU (2.6GHz) & Tesla K20X) Cray Power Management DataBase (PMDB) + pm_counters Sysfs files (updated at 10 Hz) TTS ETS reduction KPPA KPP KPP Serial TTS ETS OpenMP TTS ETS CUDA TTS ETS 0-dim box model x 1.3 x 1.4 x 1.0 x 1.0 x 3.5 x 5.4 x 25.5 x 23.5 x 33.3 x 18.8 extended box model x 1.4 x 1.4 x 1.0 x 1.0 x 3.4 x 5.3 x 22.3 x 23.3 x 23.2 x
11 COSMO-ART: Gas Phase Chemistry Optimisations Starting point: COSMO-ART (ref): initial reference baseline based on COSMO_4.30 Mixed- and single-precision, e.g., COSMO-ART (sp-dp): mixed precision version based on COSMO_5.1 (beta) and ART_3.0 COSMO-ART (sp): single precision version based on COSMO_5.1 (beta) and ART_3.0 PRACE 2IP WP8 Integrator (G. Fanourgakis, J. Lelieveld and D. Taraborelli) Time-step control: as proposed by Söderlind Positive definition: artificial preservation of positivity to improve stability COSMO-ART (sp, PRACE): based on COSMO_5.1 (beta) and ART_3.0 with pos. def., new time-step control COSMO-ART (sp, PRACE, KPPA): same as above but based on KPPA Replace COSMO with OPCODE COSMO (HP2C project with CPU and GPU support) Slightly different configuration : Requires revised shallow convection scheme Semi-Lagrangian advection scheme (SL3_SC) is slightly different than original SL3_SFD Requires adapted radiation scheme (roughly same run-time) Results now scientifically validated by H. Vogel (KIT) and J. Charles (CSCS) COSMO-ART (sp, PRACE, OPCODE): limited to Cray compiler (because of GPU components) 11
12 COSMO-ART: Preliminary Benchmarking Proof-of-concept benchmarking on two computing platforms at ETH Zurich CSCS : Piz Daint: Cray XC30-8-core Intel Xeon Sandy Bridge E CPU (2.6GHz) per compute node Piz Dora: Cray XC40 - two 12-core Intel Xeon Haswell E v3 CPUs (2.6GHz) per compute node For both : Cray Power Management DataBase (PMDB) + pm_counters Sysfs files (updated at 10 Hz) Remarks: 24h simulation using 288 PEs and the GNU compiler (but: Cray -O2 for OPCODE) COSMO_5.1 (beta) provided by O. Fuhrer and X. Lapillonne (MeteoSwiss) supports a generic tracer transport mechanism for prognostic variables allows a flexible definition of new tracers ART_3.0 provided by H. & B. Vogel (KIT) extensive support from H. & B. Vogel (KIT) for debugging 12
13 COSMO-ART: OPCODE, OpenMP, P.I., Piz Dora ; Intermediate result Constant MPI decomposition (192 processes), variable #nodes and threads Energy Energy-to-solution (J) Time-to-solution (s) Time 0 N=8 N=16 #MPI=24 #MPI=12 1,2 th. 1,2,4 th. N=24 #MPI=8 1,2,6 th. N=48 #MPI=4 1,2,4,6,12 th. N=96 #MPI=2 1,2,4,6,12,24 th. Bottom line: Optimal ETS is on minimal #nodes, with each core running 1 MPI process 13
14 Baseline vs. Final code version : OPCODE COSMO-ART, SP, P.I. Comparison with 1040 cores (=MPI processes) in both cases Energy : CPU + Interconnect + Blowers + AC/DC Conversion 14
15 Crosscutting Activities with other teams 15
16 Performance/Energy-Efficiency Analysis (UHAM/UJI) D5.1 (M24): Benchmarking report on energy requirements of the current COSMO-ART TINTORRUM (UJI) 16 nodes with 2x Intel Westmere E5645 hex-core (2.4 GHz) => 192 MPI processes Power Measurement System (UHAM) ACP8653 Power Distribution Units (PDUs) with 1 S/s and ± 3% accuracy High resolution power-performance tracing framework Extrae instrumentation library + pmlib tracing server + Paraver Visualise and correlate tasks traces with power profile Software Environment COSMO-ART baseline (initial model setup) OpenMPI 1.6.5: 192 cores using 12 MPI processes per node Two MPI policies (UJI) Aggressive : CPU busy-waiting for incoming message Degraded:repeated calls to sched_yield(), picked by the OS 16
17 Impact WP2 (IBM) results on Showcase: KPP on POWER8 Key Aspect : Problem is decoupled in space each point has set of ODEs solved with KPP Considered optimisations: High thread parallelism software optimisation (left, baseline) Loop merging (center) Fast exponential, logarithm, and power evaluation for coefficients of chemical reactions (right, IBM-specific) Transactional memory (was not applicable) Iterative refinement for the linear system, e.g., LU in low precision residual in high precision may be pursued in future Time reduction: 1 thread per core -39%, 8 threads per core -68 % Power increase: 1 thread per core +1%, 8 threads per core +15 % Energy reduction: 1 thread per core -30%, 8 threads per core -58% Unfortunately even the IBM-nonspecific optimisations did not yield performance improvement on target Piz Dora Intel Haswell platform 17
18 Model Reduction of Atm. Chem. Kinetics (UHEI/KIT) Roadmap point #7: Feasibility Study Investigation of popular approaches: Removal of species Lumping into pseudo-species Time-scale separation Repro-modelling and functional representation Assessment of the feasibility within COSMO-ART Focus on Repro-modelling : High-Dimensional Model Representation (HDMR) Implementation and Testing of HDMR : 0D box model : atmospheric chemistry test problem (Kuhn et al., 1998) Results : HDMR models can be tailored to meet any accuracy requirements a the price of higher computing demands for their (a-priori) construction and evaluation HDMR predictions with acceptable accuracy save up to 99% of computing time vs. Rosenbrock Conclusions : HDMR offers a promising approach to reduce time-/energy demands for ART chemical kinetics Further investigation needed to construct optimal HDMR expansions, requiring expert knowledge Other Crosscutting Results : investigate suitability of asynchronous iteration and multigrid methods COSMO-ART mathematical properties and problem size were not suitable for these techniques 18
19 GPU Results 19
20 GPU proofs of concept 1) Replacement of COSMO by OPCODE COSMO (CPU/GPU-enabled) (4 nodes) COSMO-ART TTS Dynamics Physics MPI Comm. (Dyn.) MPI Sync. (Dyn.) 2) Extended box model : utilisation of (CPU/GPU-enabled) KPPA solvers (single node) TTS ETS reduction KPP TTS ETS KPP TTS ETS KPPA Serial TTS ETS OpenMP TTS ETS CUDA TTS ETS 0-dim box model x 1.3 x 1.4 x 1.0 x 1.0 x 3.5 x 5.4 x 25.5 x 23.5 x 33.3 x 18.8 extended box model x 1.4 x 1.4 x 1.0 x 1.0 x 3.4 x 5.3 x 22.3 x 23.3 x 23.2 x ) Utilisation of CPU/GPU-enabled STELLA for Graupel Microphysics 20
21 Bottom Line Summary Planned : 5x ETS improvement in full COSMO-ART benchmark Achieved 3.3x: with OPCODE COSMO, algorithmic improvements, on typical configuration (1040 cores on Piz Dora platform, 44 dual-socket Intel Haswell CPU nodes). Valuable contribution to atmospheric chemistry community For GPU platforms: component benchmarks indicate additional factor >1.6x possible GPU implementation of end-to-end COSMO-ART not completed (unfortunately for CSCS) KPPA had unresolved issues when run in COSMO-ART context Software management issues in merge more time-consuming than expected Results : COSMO-ART community has immediate benefits with the new code for CPU platforms Three exploitable results delivered : STELLA microphysics (CPU/GPU), box model test framework (CPU/GPU), OPCODE COSMO-ART SP with PRACE Integrator (CPU-only) ARM platform was tested with box model (T5.3); result : GPU architectures more promising Enriching collaborations with Exa2Green partners, e.g., ART development (KIT), power monitoring (UHAM/UJI), box model optimisations (IBM), model reduction (UHEI/KIT) The full documentation is available on: 21
Evaluating the Performance and Energy Efficiency of the COSMO-ART Model System
Evaluating the Performance and Energy Efficiency of the COSMO-ART Model System Joseph Charles & William Sawyer (CSCS), Manuel F. Dolz (UHAM), Sandra Catalán (UJI) EnA-HPC, Dresden September 1-2, 2014 1
More informationDeutscher Wetterdienst
Porting Operational Models to Multi- and Many-Core Architectures Ulrich Schättler Deutscher Wetterdienst Oliver Fuhrer MeteoSchweiz Xavier Lapillonne MeteoSchweiz Contents Strong Scalability of the Operational
More informationPLAN-E Workshop Switzerland. Welcome! September 8, 2016
PLAN-E Workshop Switzerland Welcome! September 8, 2016 The Swiss National Supercomputing Centre Driving innovation in computational research in Switzerland Michele De Lorenzi (CSCS) PLAN-E September 8,
More informationAdapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs
Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs O. Fuhrer, T. Gysi, X. Lapillonne, C. Osuna, T. Dimanti, T. Schultess and the HP2C team Eidgenössisches
More informationDeutscher Wetterdienst
Accelerating Work at DWD Ulrich Schättler Deutscher Wetterdienst Roadmap Porting operational models: revisited Preparations for enabling practical work at DWD My first steps with the COSMO on a GPU First
More informationNVIDIA Update and Directions on GPU Acceleration for Earth System Models
NVIDIA Update and Directions on GPU Acceleration for Earth System Models Stan Posey, HPC Program Manager, ESM and CFD, NVIDIA, Santa Clara, CA, USA Carl Ponder, PhD, Applications Software Engineer, NVIDIA,
More informationPP POMPA (WG6) News and Highlights. Oliver Fuhrer (MeteoSwiss) and the whole POMPA project team. COSMO GM13, Sibiu
PP POMPA (WG6) News and Highlights Oliver Fuhrer (MeteoSwiss) and the whole POMPA project team COSMO GM13, Sibiu Task Overview Task 1 Performance analysis and documentation Task 2 Redesign memory layout
More informationCLAW FORTRAN Compiler source-to-source translation for performance portability
CLAW FORTRAN Compiler source-to-source translation for performance portability XcalableMP Workshop, Akihabara, Tokyo, Japan October 31, 2017 Valentin Clement valentin.clement@env.ethz.ch Image: NASA Summary
More informationDynamical Core Rewrite
Dynamical Core Rewrite Tobias Gysi Oliver Fuhrer Carlos Osuna COSMO GM13, Sibiu Fundamental question How to write a model code which allows productive development by domain scientists runs efficiently
More informationNews from the consortium
Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss News from the consortium Swiss COSMO User Workshop 1st November 2012 COSMO users for NWP by 2012 Members
More informationGPU Consideration for Next Generation Weather (and Climate) Simulations
GPU Consideration for Next Generation Weather (and Climate) Simulations Oliver Fuhrer 1, Tobias Gisy 2, Xavier Lapillonne 3, Will Sawyer 4, Ugo Varetto 4, Mauro Bianco 4, David Müller 2, and Thomas C.
More informationAlgorithms, System and Data Centre Optimisation for Energy Efficient HPC
2015-09-14 Algorithms, System and Data Centre Optimisation for Energy Efficient HPC Vincent Heuveline URZ Computing Centre of Heidelberg University EMCL Engineering Mathematics and Computing Lab 1 Energy
More informationAn update on the COSMO- GPU developments
An update on the COSMO- GPU developments COSMO User Workshop 2014 X. Lapillonne, O. Fuhrer, A. Arteaga, S. Rüdisühli, C. Osuna, A. Roches and the COSMO- GPU team Eidgenössisches Departement des Innern
More informationAccelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies
Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia
More informationPorting COSMO to Hybrid Architectures
Porting COSMO to Hybrid Architectures T. Gysi 1, O. Fuhrer 2, C. Osuna 3, X. Lapillonne 3, T. Diamanti 3, B. Cumming 4, T. Schroeder 5, P. Messmer 5, T. Schulthess 4,6,7 [1] Supercomputing Systems AG,
More informationThe challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.
The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy! Thomas C. Schulthess ENES HPC Workshop, Hamburg, March 17, 2014 T. Schulthess!1
More informationPhysical parametrizations and OpenACC directives in COSMO
Physical parametrizations and OpenACC directives in COSMO Xavier Lapillonne Eidgenössisches Departement des Innern EDI Bundesamt für Meteorologie und Klimatologie MeteoSchweiz Name (change on Master slide)
More informationIllinois Proposal Considerations Greg Bauer
- 2016 Greg Bauer Support model Blue Waters provides traditional Partner Consulting as part of its User Services. Standard service requests for assistance with porting, debugging, allocation issues, and
More informationANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation
ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent
More informationBig Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures
Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid
More informationHETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA
HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS
More informationRAMSES on the GPU: An OpenACC-Based Approach
RAMSES on the GPU: An OpenACC-Based Approach Claudio Gheller (ETHZ-CSCS) Giacomo Rosilho de Souza (EPFL Lausanne) Romain Teyssier (University of Zurich) Markus Wetzstein (ETHZ-CSCS) PRACE-2IP project EU
More informationA PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Maxime Martinasso, Grzegorz Kwasniewski, Sadaf R. Alam, Thomas C. Schulthess, Torsten Hoefler Swiss National Supercomputing
More informationLS-DYNA Performance Benchmark and Profiling. October 2017
LS-DYNA Performance Benchmark and Profiling October 2017 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: LSTC, Huawei, Mellanox Compute resource
More informationFederal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss. PP POMPA status.
Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss PP POMPA status Xavier Lapillonne Performance On Massively Parallel Architectures Last year of the project
More informationFirst Experiences With Validating and Using the Cray Power Management Database Tool
First Experiences With Validating and Using the Cray Power Management Database Tool Gilles Fourestey, Ben Cumming, Ladina Gilly, and Thomas C. Schulthess Swiss National Supercomputing Center, ETH Zurich,
More informationTECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0) Contributing sites and the corresponding computer systems for this call are: BSC, Spain IBM System x idataplex CINECA, Italy Lenovo System
More informationOptimizing an Earth Science Atmospheric Application with the OmpSs Programming Model
www.bsc.es Optimizing an Earth Science Atmospheric Application with the OmpSs Programming Model HPC Knowledge Meeting'15 George S. Markomanolis, Jesus Labarta, Oriol Jorba University of Barcelona, Barcelona,
More informationPedraforca: a First ARM + GPU Cluster for HPC
www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu
More informationGPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.
GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran G. Ruetsch, M. Fatica, E. Phillips, N. Juffa Outline WRF and RRTM Previous Work CUDA Fortran Features RRTM in CUDA
More informationDeutscher Wetterdienst. Ulrich Schättler Deutscher Wetterdienst Research and Development
Deutscher Wetterdienst COSMO, ICON and Computers Ulrich Schättler Deutscher Wetterdienst Research and Development Contents Problems of the COSMO-Model on HPC architectures POMPA and The ICON Model Outlook
More informationAn evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks
An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks WRF Model NASA Parallel Benchmark Intel MPI Bench My own personal benchmark HPC Challenge Benchmark Abstract
More informationPiz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design
Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Sadaf Alam & Thomas Schulthess CSCS & ETHzürich CUG 2014 * Timelines & releases are not precise Top 500
More informationPorting and Optimizing the COSMOS coupled model on Power6
Porting and Optimizing the COSMOS coupled model on Power6 Luis Kornblueh Max Planck Institute for Meteorology November 5, 2008 L. Kornblueh, MPIM () echam5 November 5, 2008 1 / 21 Outline 1 Introduction
More informationOverlapping Computation and Communication for Advection on Hybrid Parallel Computers
Overlapping Computation and Communication for Advection on Hybrid Parallel Computers James B White III (Trey) trey@ucar.edu National Center for Atmospheric Research Jack Dongarra dongarra@eecs.utk.edu
More informationCMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP**
CMAQ 5.2.1 PARALLEL PERFORMANCE WITH MPI AND OPENMP** George Delic* HiPERiSM Consulting, LLC, P.O. Box 569, Chapel Hill, NC 27514, USA 1. INTRODUCTION This presentation reports on implementation of the
More informationTwo-Phase flows on massively parallel multi-gpu clusters
Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous
More informationPortable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.
Portable and Productive Performance with OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 Cray: Leadership in Computational Research Earth Sciences
More informationTECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 14 th CALL (T ier-0)
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 14 th CALL (T ier0) Contributing sites and the corresponding computer systems for this call are: GENCI CEA, France Bull Bullx cluster GCS HLRS, Germany Cray
More informationScalable Dynamic Load Balancing of Detailed Cloud Physics with FD4
Center for Information Services and High Performance Computing (ZIH) Scalable Dynamic Load Balancing of Detailed Cloud Physics with FD4 Minisymposium on Advances in Numerics and Physical Modeling for Geophysical
More informationUsing EasyBuild and Continuous Integration for Deploying Scientific Applications on Large Scale Production Systems
Using EasyBuild and Continuous Integration for Deploying Scientific Applications on Large HPC Advisory Council Swiss Conference Guilherme Peretti-Pezzi, CSCS April 11, 2017 Table of Contents 1. Introduction:
More informationEULAG: high-resolution computational model for research of multi-scale geophysical fluid dynamics
Zbigniew P. Piotrowski *,** EULAG: high-resolution computational model for research of multi-scale geophysical fluid dynamics *Geophysical Turbulence Program, National Center for Atmospheric Research,
More informationLS-DYNA Performance Benchmark and Profiling. October 2017
LS-DYNA Performance Benchmark and Profiling October 2017 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: LSTC, Huawei, Mellanox Compute resource
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationOptimising the Mantevo benchmark suite for multi- and many-core architectures
Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationPerformance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf
PADC Anual Workshop 20 Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture Alexander Berreth RECOM Services GmbH, Stuttgart Markus Bühler, Benedikt Anlauf IBM Deutschland
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationREQUEST FOR A SPECIAL PROJECT
REQUEST FOR A SPECIAL PROJECT 2018 2020 MEMBER STATE: Germany, Greece, Italy This form needs to be submitted via the relevant National Meteorological Service. Principal Investigator 1 Amalia Iriza (NMA,Romania)
More information- Part 3 - Energy Aware Numerics. Vincent Heuveline
- Part 3 - Energy Aware Numerics Vincent Heuveline 1 The Challenge 2 Exa2green Engineering Mathematics and Computing Lab UHEI Steinbeis Europa Zentrum SEZ Scientific Computing Group UHAM High Performance
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationA performance portable implementation of HOMME via the Kokkos programming model
E x c e p t i o n a l s e r v i c e i n t h e n a t i o n a l i n t e re s t A performance portable implementation of HOMME via the Kokkos programming model L.Bertagna, M.Deakin, O.Guba, D.Sunderland,
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationPorting the ICON Non-hydrostatic Dynamics and Physics to GPUs
Porting the ICON Non-hydrostatic Dynamics and Physics to GPUs William Sawyer (CSCS/ETH), Christian Conti (ETH), Xavier Lapillonne (C2SM/ETH) Programming weather, climate, and earth-system models on heterogeneous
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationPerformance of deal.ii on a node
Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions
More informationStatus of the COSMO GPU version
Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss Status of the COSMO GPU version Xavier Lapillonne Contributors in 2015 (Thanks!) Alon Shtivelman Andre Walser
More informationReal Parallel Computers
Real Parallel Computers Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel Computing 2005 Short history
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationHPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)
HPC and IT Issues Session Agenda Deployment of Simulation (Trends and Issues Impacting IT) Discussion Mapping HPC to Performance (Scaling, Technology Advances) Discussion Optimizing IT for Remote Access
More informationCPMD Performance Benchmark and Profiling. February 2014
CPMD Performance Benchmark and Profiling February 2014 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting
More informationLS-DYNA Performance Benchmark and Profiling. April 2015
LS-DYNA Performance Benchmark and Profiling April 2015 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource
More informationANSYS HPC Technology Leadership
ANSYS HPC Technology Leadership 1 ANSYS, Inc. November 14, Why ANSYS Users Need HPC Insight you can t get any other way It s all about getting better insight into product behavior quicker! HPC enables
More informationDOI: /jsfi Towards a performance portable, architecture agnostic implementation strategy for weather and climate models
DOI: 10.14529/jsfi140103 Towards a performance portable, architecture agnostic implementation strategy for weather and climate models Oliver Fuhrer 1, Carlos Osuna 2, Xavier Lapillonne 2, Tobias Gysi 3,4,
More informationOP2 FOR MANY-CORE ARCHITECTURES
OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC
More informationNAMD Performance Benchmark and Profiling. January 2015
NAMD Performance Benchmark and Profiling January 2015 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource
More informationUsing an HPC Cloud for Weather Science
Using an HPC Cloud for Weather Science Provided By: Transforming Operational Environmental Predictions Around the Globe Moving EarthCast Technologies from Idea to Production EarthCast Technologies produces
More informationIt s a Multicore World. John Urbanic Pittsburgh Supercomputing Center
It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all
More informationThe Stampede is Coming: A New Petascale Resource for the Open Science Community
The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation
More informationHPC projects. Grischa Bolls
HPC projects Grischa Bolls Outline Why projects? 7th Framework Programme Infrastructure stack IDataCool, CoolMuc Mont-Blanc Poject Deep Project Exa2Green Project 2 Why projects? Pave the way for exascale
More informationMaking Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010
Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010 Windows HPC Server 2008 R2 Windows HPC Server 2008 R2 makes supercomputing
More informationGPU computing at RZG overview & some early performance results. Markus Rampp
GPU computing at RZG overview & some early performance results Markus Rampp Introduction Outline Hydra configuration overview GPU software environment Benchmarking and porting activities Team Renate Dohmen
More informationCray XC Scalability and the Aries Network Tony Ford
Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?
More informationANSYS Fluent 14 Performance Benchmark and Profiling. October 2012
ANSYS Fluent 14 Performance Benchmark and Profiling October 2012 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information
More informationThe Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011
The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011 HPC Scale Working Group, Dec 2011 Gilad Shainer, Pak Lui, Tong Liu,
More informationThe Icosahedral Nonhydrostatic (ICON) Model
The Icosahedral Nonhydrostatic (ICON) Model Scalability on Massively Parallel Computer Architectures Florian Prill, DWD + the ICON team 15th ECMWF Workshop on HPC in Meteorology October 2, 2012 ICON =
More informationPerformance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA
Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to
More informationUCX: An Open Source Framework for HPC Network APIs and Beyond
UCX: An Open Source Framework for HPC Network APIs and Beyond Presented by: Pavel Shamis / Pasha ORNL is managed by UT-Battelle for the US Department of Energy Co-Design Collaboration The Next Generation
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationExperiences Using Tegra K1 and X1 for Highly Energy Efficient Computing
Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing Gaurav Mitra Andrew Haigh Luke Angove Anish Varghese Eric McCreath Alistair P. Rendell Research School of Computer Science Australian
More informationHPC IN EUROPE. Organisation of public HPC resources
HPC IN EUROPE Organisation of public HPC resources Context Focus on publicly-funded HPC resources provided primarily to enable scientific research and development at European universities and other publicly-funded
More informationVectorisation and Portable Programming using OpenCL
Vectorisation and Portable Programming using OpenCL Mitglied der Helmholtz-Gemeinschaft Jülich Supercomputing Centre (JSC) Andreas Beckmann, Ilya Zhukov, Willi Homberg, JSC Wolfram Schenck, FH Bielefeld
More informationICON Performance Benchmark and Profiling. March 2012
ICON Performance Benchmark and Profiling March 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource - HPC
More informationVisual Analysis of Lagrangian Particle Data from Combustion Simulations
Visual Analysis of Lagrangian Particle Data from Combustion Simulations Hongfeng Yu Sandia National Laboratories, CA Ultrascale Visualization Workshop, SC11 Nov 13 2011, Seattle, WA Joint work with Jishang
More informationCOMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES
COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:
More informationChallenges Simulating Real Fuel Combustion Kinetics: The Role of GPUs
Challenges Simulating Real Fuel Combustion Kinetics: The Role of GPUs M. J. McNenly and R. A. Whitesides GPU Technology Conference March 27, 2014 San Jose, CA LLNL-PRES-652254! This work performed under
More informationChallenges in adapting Particle-In-Cell codes to GPUs and many-core platforms
Challenges in adapting Particle-In-Cell codes to GPUs and many-core platforms L. Villard, T.M. Tran, F. Hariri *, E. Lanti, N. Ohana, S. Brunner Swiss Plasma Center, EPFL, Lausanne A. Jocksch, C. Gheller
More informationMILC Performance Benchmark and Profiling. April 2013
MILC Performance Benchmark and Profiling April 2013 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting
More informationEvaluating OpenMP s Effectiveness in the Many-Core Era
Evaluating OpenMP s Effectiveness in the Many-Core Era Prof Simon McIntosh-Smith HPC Research Group simonm@cs.bris.ac.uk 1 Bristol, UK 10 th largest city in UK Aero, finance, chip design HQ for Cray EMEA
More informationTECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0) Contributing sites and the corresponding computer systems for this call are: BSC, Spain IBM System X idataplex CINECA, Italy The site selection
More informationExploiting CUDA Dynamic Parallelism for low power ARM based prototypes
www.bsc.es Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes Vishal Mehta Engineer, Barcelona Supercomputing Center vishal.mehta@bsc.es BSC/UPC CUDA Centre of Excellence (CCOE) Training
More informationADAC Federated Testbed Creating a Blueprint for Portable Ecosystems
ADAC Federated Testbed Creating a Blueprint for Portable Ecosystems Sadaf Alam, Jeffrey Vetter, Mark Klein, Maxime Martinasso, ExCL team @ ORNL,... ADAC Workshop February 15, 2018 January, 2016 June, 2016
More informationReal Parallel Computers
Real Parallel Computers Modular data centers Overview Short history of parallel machines Cluster computing Blue Gene supercomputer Performance development, top-500 DAS: Distributed supercomputing Short
More informationACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015
ACCELERATED COMPUTING: THE PATH FORWARD Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015 COMMODITY DISRUPTS CUSTOM SOURCE: Top500 ACCELERATED COMPUTING: THE PATH FORWARD It s time to start
More informationSystem Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.
System Design of Kepler Based HPC Solutions Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering. Introduction The System Level View K20 GPU is a powerful parallel processor! K20 has
More informationThe Mont-Blanc Project
http://www.montblanc-project.eu The Mont-Blanc Project Daniele Tafani Leibniz Supercomputing Centre 1 Ter@tec Forum 26 th June 2013 This project and the research leading to these results has received funding
More informationOPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA
OPEN MPI WITH RDMA SUPPORT AND CUDA Rolf vandevaart, NVIDIA OVERVIEW What is CUDA-aware History of CUDA-aware support in Open MPI GPU Direct RDMA support Tuning parameters Application example Future work
More informationThe Center for High Performance Computing. Dell Breakfast Events 20 th June 2016 Happy Sithole
The Center for High Performance Computing Dell Breakfast Events 20 th June 2016 Happy Sithole Background: The CHPC in SA CHPC User Community: South Africa CHPC Existing Users Future Users Introduction
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More information