Optimizing an Earth Science Atmospheric Application with the OmpSs Programming Model

Similar documents
Performance Tools (Paraver/Dimemas)

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Barcelona Supercomputing Center

Introducing OpenMP Tasks into the HYDRO Benchmark

OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel

From the latency to the throughput age. Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC

CellSs Making it easier to program the Cell Broadband Engine processor

Evaluating the Performance and Energy Efficiency of the COSMO-ART Model System

Xeon Phi Performance for HPC-based Computational Mechanics Codes

Tutorial OmpSs: Overlapping communication and computation

POP CoE: Understanding applications and how to prepare for exascale

Dynamic Load Balancing for Weather Models via AMPI

Large scale Imaging on Current Many- Core Platforms

Introduction to parallel Computing

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI

Hardware Hetergeneous Task Scheduling for Task-based Programming Models

Performance POP up. EU H2020 Center of Excellence (CoE)

Parallel Mesh Partitioning in Alya

ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems

HPC Application Porting to CUDA at BSC

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Advances of parallel computing. Kirill Bogachev May 2016

Hybrid Use of OmpSs for a Shock Hydrodynamics Proxy Application

Extending the Task-Aware MPI (TAMPI) Library to Support Asynchronous MPI primitives

arxiv: v1 [cs.dc] 10 Jan 2019

The Mont-Blanc project Updates from the Barcelona Supercomputing Center

Our new HPC-Cluster An overview

Computer Architectures and Aspects of NWP models

On the scalability of tracing mechanisms 1

A TIMING AND SCALABILITY ANALYSIS OF THE PARALLEL PERFORMANCE OF CMAQ v4.5 ON A BEOWULF LINUX CLUSTER

An Introduction to Parallel Programming

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Task-parallel reductions in OpenMP and OmpSs

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

SC12 HPC Educators session: Unveiling parallelization strategies at undergraduate level

ORAP Forum October 10, 2013

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Optimize HPC - Application Efficiency on Many Core Systems

Pedraforca: a First ARM + GPU Cluster for HPC

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Unrolling Loops Containing Task Parallelism

Resilience for Task-Based Parallel Codes Marc Casas Guix

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

RAMSES on the GPU: An OpenACC-Based Approach

Software and Performance Engineering for numerical codes on GPU clusters

Modelling and optimisation of scientific software for multicore platforms

Overview of research activities Toward portability of performance

CMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP**

A Case for High Performance Computing with Virtual Machines

Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms

Performance Diagnosis through Classification of Computation Bursts to Known Computational Kernel Behavior

Sweep3D analysis. Jesús Labarta, Judit Gimenez CEPBA-UPC

Porting and Optimizing the COSMOS coupled model on Power6

BSC Tools. Challenges on the way to Exascale. Efficiency (, power, ) Variability. Memory. Faults. Scale (,concurrency, strong scaling, )

The Architecture and the Application Performance of the Earth Simulator

xsim The Extreme-Scale Simulator

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

Feature Detection Plugins Speed-up by

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Scientific Programming in C XIV. Parallel programming

Programming model and application porting to the Dynamical Exascale Entry Platform (DEEP)

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Accelerating image registration on GPUs

BSC Tools Hands-On. Judit Giménez, Lau Mercadal Barcelona Supercomputing Center

MB3 D6.1 Report on profiling and benchmarking of the initial set of applications on ARM-based HPC systems Version 1.1

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Running the FIM and NIM Weather Models on GPUs

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

ANALYSIS OF CMAQ PERFORMANCE ON QUAD-CORE PROCESSORS. George Delic* HiPERiSM Consulting, LLC, P.O. Box 569, Chapel Hill, NC 27514, USA

Hybrid Strategies for the NEMO Ocean Model on! Many-core Processors! I. Epicoco, S. Mocavero, G. Aloisio! University of Salento & CMCC, Italy!

URBAN SCALE CROWD DATA ANALYSIS, SIMULATION, AND VISUALIZATION

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

ICON Performance Benchmark and Profiling. March 2012

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes

HPC Architectures. Types of resource currently in use

Charm++ Workshop 2010

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Tuning Alya with READEX for Energy-Efficiency

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

Advanced Profiling of GROMACS

A brief introduction to OpenMP

The determination of the correct

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Maxwell: a 64-FPGA Supercomputer

Single-Points of Performance

Cloud interoperability and elasticity with COMPSs

Performance Optimization Strategies for WRF Physics Schemes Used in Weather Modeling

Bring your application to a new era:

Butterfly effect of porting scientific applications to ARM-based platforms

OpenMP Doacross Loops Case Study

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Transcription:

www.bsc.es Optimizing an Earth Science Atmospheric Application with the OmpSs Programming Model HPC Knowledge Meeting'15 George S. Markomanolis, Jesus Labarta, Oriol Jorba University of Barcelona, Barcelona, 3 February 2015

Outline Introduction to NMMB/BSC-CTM Performance overview of NMMB/BSC-CTM Experiences with OmpSs Future work 2

Severo-Ochoa Earth Sciences Application Development of a Unified Meteorology/Air Quality/Climate model Towards a global high-resolution system for global to local assessments Extending NMMB/BSC-CTM from coarse regional scales to global high-resolution configurations Coupling with a Data Assimilation System for Aerosols International collaborations: Meteorology National Centers for Environmental Predictions Climate Global aerosols Goddard Institute Space Studies Air Quality Uni. of California Irvine

Where do we solve the primitive equations? Grid discretization High performance computing resources: If we plan to solve small scale features we need higher resolution in the mesh and so more HPC resources are required.

Parallelizing Atmospheric Models We need to be able to run this models in Multi-core architectures. Model domain is decomposed in patches Patch: portion of the model domain allocated to a distributed/shared memory node. Patch MPI Communication with neighbours 5

NMMB/BSC-CTM NMMB/BSC-CTM is used operationally for the dust forecast center in Barcelona NMMB is the operational model of NCEP The general purpose is to improve its scalability and the simulation resolution 6

MareNostrum III 3,056 compute nodes 2x Intel SandyBridge-EP E5-2670 2.6 GHz 32 GB memory per node Infiniband FDR10 OpenMPI 1.8.1 ifort 13.0.1 7

Performance Overview of NMMB/BSC-CTM Model

Execution diagram Focus on the Model

Paraver One hour simulation of NMMB/BSC-CTM, global, 24km, 64 layers meteo: 9 tracers meteo + aerosols: 9 + 16 tracers meteo + aerosols + gases: 9 + 16 + 53

Dynamic load balancing Different simulation dates cause different load balance issue (useful functions, global 24km meteo configuration) 20/12/2005 20/07/2005

A "different" view point eff.= LB * Ser * Trf LB Ser Trf Eff 0.83 0.97 0.80 0.87 0.9 0.78 0.88 0.97 0.84 0.73 0.88 0.96 0.75 0.61

One hour simulation One hour simulation, chemistry configuration for global model, 24 km resolution 13

One hour simulation Useful functions 14

Zoom on EBI solver and next calls EBI solver (run_ebi) and the calls till the next call to the solver 15

Zoom between EBI solvers The useful functions call between two EBI solvers The first two dark blue areas are horizontal diffusion calls and the light dark is advection chemistry. 16

Horizontal diffusion We zoom on horizontal diffusion and the calls that follow Horizontal diffusion (blue colour) has load imbalance 17

Experiences with OmpSs

Objectives Trying to apply OmpSs on a real application Applying incremental methodology Identify opportunities Exploring difficulties 19

OmpSs introduction Parallel Programming Model - Build on existing standard: OpenMP - Directive based to keep a serial version - Targeting: SMP, clusters, and accelerator devices - Developed in Barcelona Supercomputing Center (BSC) Mercurium source-to-source compiler Nanos++ runtime system https://pm.bsc.es/ompss 20

Studying cases Taskify a computation routine and investigate potential improvements and overlap between computations of different granularities Overlap communication with packing and unpacking costs Overlap independent coarse grain operations composed of communication and computation phases 21

Horizontal diffusion + communication The horizontal diffusion has some load imbalance(blue code) There is some computation about packing/unpacking data for the communication buffers (red area) Gather (green colour) and scatter for the FFTs 22

Horizontal diffusion skeleton code The hdiff subroutine has the following loops and dependencies 23

Parallelizing loops Part of hdiff with 2 threads Parallelizing the most important loops We have a speedup of 1.3 by using worksharing 24

Comparison The execution of hdiff subroutine with 1 thread takes 120 ms The execution of hdiff subroutine with 2 threads takes 56 ms, the speedup is 2.14 25

Issues related to communication We study the exch4 subroutine (red colour) The useful function of exch4 has some computation The communication creates a pattern and the duration of the MPI_Wait calls can vary 26

Issues related to communication Big load imbalance because message order There is also some computation 27

Taskify subroutine exch4 We observe the MPI_Wait calls in the first thread In the same moment the second thread does the necessary computation and overlaps the communication 28

Taskify subroutine exch4 The total execution of exch4 subrouting with 1 thread The total execution of exch4 subroutine with 2 threads With 2 threads the speedup is 1.76 (more improvements have been identified) 29

Advection chemistry and FFT Advection chemistry (blue color) and 54 calls to gather/fft/scatter till monotonization chemistry (brown color) Initial study to test the improvements of the execution with the OmpSs programming model 30

Study case: gather/fft/scatter Workflow, two iterations, using two threads, declaring dependencies 31

Study case: gather/fft/scatter Paraver view, two iterations, four tracers totally Thread 1: Iteration 1: FFT_1, scatter_1, scatter_2 Iteration 2: gather_4, FFT_4, scatter_4 Thread 2: Iteration 1: gather_1, gather_2, fft_2 Iteration 2: gather_3, FFT_3, scatter_3 32

Study case: gather/fft/scatter - Performance Comparing the execution time of 54 calls to gather/fft/scatter with one and two threads The speedup with two threads is 1.56 and we have identified potential improvements 33

MPI Bandwidth MPI bandwidth for gather/scatter MPI bandwidth over 1GB/s for gather/scatter 34

Combination of advection chemistry and FFT Advection chemistry with worksharing (not for all the loops), FFTs for one thread Similar but with two threads The speedup for the advection chemistry routine is 1.6 and overall is 1.58 35

Comparison between MPI and MPI+OmpSs Pure MPI, 128 computation processes and 4 I/O MPI + OmpSs: 64 MPI processes + 64 threads + 4 I/O The load imbalance for the FFT with pure MPI is 28% while with MPI+OmpSs is 54% 36

Incremental methodology with OmpSs Taskify the loops Start with 1 thread, use if(0) for serializing tasks Test that dependencies are correct (usually trial and error) Imagine an application crashing after adding 20+ new pragmas (true story) Do not parallelize loops that do not contain significant computation 37

Conclusions The incremental methodology is important for less overhead in the application OmpSs can be applied on a real application but is not straightforward It can achieve pretty good speedup, depending on the case Overlapping communication with computation is a really interesting topic We are still in the beginning but OmpSs seems promising 38

Future improvements Investigate the usage of multithreaded MPI One of the main functions of the application is the EBI solver (run_ebi). There is a problem with global variables that make the function not reentrant. Refactoring of the code is needed. Porting more code to OmpSs and investigate MPI calls as tasks Some computation is independent to the model's layers or to tracers. OpenCL kernels are going to be developed to test the performance on accelerators Testing versioning scheduler The dynamic load balancing library should be studied further (http://pm.bsc.es/dlb) 39

www.bsc.es Thank you! For further information please contact georgios.markomanolis@bsc.es "Work funded by the SEV-2011-00067 grant of the Severo Ochoa Program, awarded by the Spanish Government." Acknowledgements: Rosa M. Badia, Judit Gimenez, Roger Ferrer Ibáñez, Julian Morillo, Victor López, Xavier Teruel, Harald Servat, BSC support 40