Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Similar documents
Mesh reordering in Fluidity using Hilbert space-filling curves

Flexible, Scalable Mesh and Data Management using PETSc DMPlex

Hybrid programming with MPI and OpenMP On the way to exascale

Lecture 15: More Iterative Ideas

Code Saturne on POWER8 clusters: First Investigations

The IBM Blue Gene/Q: Application performance, scalability and optimisation

Hybrid Programming with MPI and SMPSs

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Automatic Tuning of Sparse Matrix Kernels

Gerald Schubert 1, Georg Hager 2, Holger Fehske 1, Gerhard Wellein 2,3 1

Parallel Languages: Past, Present and Future

OP2 FOR MANY-CORE ARCHITECTURES

Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Data partitioning and MPI adjoints. Pavanakumar Mohanamuraly Jens D. Mueller

Case study: OpenMP-parallel sparse matrix-vector multiplication

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Towards Self-Verification in Finite Difference Code Generation

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH

GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Implicit and Explicit Optimizations for Stencil Computations

Accelerated Load Balancing of Unstructured Meshes

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang

An Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

PREDICTING COMMUNICATION PERFORMANCE

for Task-Based Parallelism

Bandwidth Avoiding Stencil Computations

MPI+X on The Way to Exascale. William Gropp

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

AUTOMATIC SMT THREADING

The Icosahedral Nonhydrostatic (ICON) Model

Parallel Mesh Partitioning in Alya

European exascale applications workshop, Manchester, 11th and 12th October 2016 DLR TAU-Code - Application in INTERWinE

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

Integrating Analysis and Computation with Trios Services

How HPC Hardware and Software are Evolving Towards Exascale

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Advanced Threading and Optimization

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Parallel Interpolation in FSI Problems Using Radial Basis Functions and Problem Size Reduction

PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS

Automated Finite Element Computations in the FEniCS Framework using GPUs

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer

Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers

Generating high-performance multiplatform finite element solvers using the Manycore Form Compiler and OP2

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics

MPI and OpenMP. Mark Bull EPCC, University of Edinburgh

Developing the TELEMAC system for HECToR (phase 2b & beyond) Zhi Shang

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Peta-Scale Simulations with the HPC Software Framework walberla:

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

European exascale applications workshop, Edinburgh, 19th/20th April 2018 Asynchronous Execution in DLR's CFD Solvers

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Porting and Optimisation of UM on ARCHER. Karthee Sivalingam, NCAS-CMS. HPC Workshop ECMWF JWCRP

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

An Introduction to OpenACC

CFD exercise. Regular domain decomposition

Performance Characteristics of Hybrid MPI/OpenMP Scientific Applications on a Largescale Multithreaded BlueGene/Q Supercomputer

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

High Performance Computing for PDE Towards Petascale Computing

PRACE Workshop: Application Case Study: Code_Saturne. Andrew Sunderland, Charles Moulinec, Zhi Shang. Daresbury Laboratory, UK

Introduction to Parallel Computing

Transport Simulations beyond Petascale. Jing Fu (ANL)

CUDA GPGPU Workshop 2012

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

A priori power estimation of linear solvers on multi-core processors

A Uniform Programming Model for Petascale Computing

A Test Suite for High-Performance Parallel Java

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

Graph Partitioning for Scalable Distributed Graph Computations

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

HPC Algorithms and Applications

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Mixed OpenMP/MPI approaches on Blue Gene for CDF applications (EDF R&D and IBM Research collaboration)

Introduction to Parallel Computing

ICON for HD(CP) 2. High Definition Clouds and Precipitation for Advancing Climate Prediction

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Transcription:

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College London 2 EPCC, University of Edinburgh 3 STFC, Daresbury Laboratory 4 Fujitsu Laboratories of Europe Ltd. 9 July, 2013

Motivation Fluidity PETSc Unstructured finite element code Anisotropic mesh adaptivity Applications: CFD, geophysical flows, ocean modelling, reservoir modelling, mining, nuclear safety, renewable energies, etc. Linear solver engine Hybrid MPI/OpenMP version

Programming for Exascale Three levels of parallelism in modern HPC architectures 1 : Between nodes: Message passing via MPI Between cores: Shared memory communication Within core: SIMD Hybrid MPI/OpenMP parallelism: Memory argument MPI memory footprint not scalable Replication of halo data Speed argument Message passing overhead Improved load balance with fewer MPI ranks 1 A. D. Robison and R. E. Johnson. Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns, ParaPLoP 10, pages 5:1 5:8, New York, NY, USA, 2010. ACM

PETSc Overview Matrix and Vector classes are used in all other components Added OpenMP threading to low-level implementations: Vector operations CSR matrices Block-CSR matrices

Sparse Matrix-Vector Multiplication Matrix-Multiply is most expensive component of the solve P1 P2 P3 P4 P5 P6 P7 P8 P1 P2 P3 P4 P5 P6 P7 P8 Parallel Matrix-Multiply: Multiply diagonal submatrix Scatter/gather remote vector elements Multiply-add off-diagonal submatrices

Sparse Matrix-Vector Multiplication Input vector elements require MPI communication Hide MPI latency by overlapping with local computation Not all MPI implementations work asynchronously 2 Task-based Matrix-Multiply Dedicated thread for MPI communication Advances communication protocol Copy data to/from buffer In contrast to Vector-based threading Lift parallel section to include scatter/gather operation Cannot use parallel for pragma N 1 threads enough to saturate memory bandwidth 2 G. Schubert, H. Fehske, G. Hager, and G. Wellein. Hybrid-parallel sparse matrix-vector multiplication with explicit communication overlap on current multicore-based systems. Parallel Processing Letters, 21(3):339 358, 2011

Sparse Matrix-Vector Multiplication Thread-level Load Balance Matrix rows partitioned in blocks Create partitioning based on non-zero elements per row 3 Cache partitioning with matrix object Explicit thread-balancing scheme Initial greedy allocation Local diffusion algorithm 3 S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Computing, 35(3):178 194, 2009

Benchmark Global baroclinic ocean simulation 4 Mesh based on extruded bathymetry data Pressure matrix: 371,102,769 non-zero elements 13,491,933 Degrees of Freedom Solver options: Conjugate Gradient method Jacobi preconditioner 10,000 iterations 4 M. D. Piggott, G. J. Gorman, C. C. Pain, P. A. Allison, A. S. Candy, B. T. Martin, and M. R. Wells. A new computational framework for multi-scale ocean modelling based on adapting unstructured meshes. International Journal for Numerical Methods in Fluids, 56(8):1003 1015, 2008

Cray XE6 (HECToR) Architecture Overview NUMA architecture 32 cores per node 4 NUMA domains, 8 cores each Fujitsu PRIMEHPC FX10 UMA architecture 16 cores per node IBM BlueGene/Q UMA architecture 16 cores per node 4-way hardware threading (SMT)

Hardware Utilisation: 128 cores 350 300 XE6: Vector XE6: Task XE6: Task, NZ-balanced FX10: Vector FX10: Task FX10: Task, NZ-balanced Runtime (s) 250 200 150 100 1 2 4 8 16 32 No. of Threads / MPI process On XE6 slowdown with multiple NUMA domains Performance bound by memory-latency

Hardware Utilisation: 1024 cores Runtime (s) 60 55 50 45 40 35 30 25 XE6: Vector XE6: Task XE6: Task, NZ-balanced FX10: Vector FX10: Task FX10: Task, NZ-balanced 20 1 2 4 8 16 32 No. of Threads / MPI process Both task-based algorithms improve NZ-based load balancing now faster

Hardware Utilisation: 4096 cores 12 11 10 XE6: Vector XE6: Task XE6: Task, NZ-balanced Runtime (s) 9 8 7 6 5 1 2 4 8 16 32 No. of Threads / MPI process Vector-based approach bound by MPI communication Explicit thread-balancing improves memory bandwidth utilisation, but worsens latency effects!

Strong Scaling: Cray XE6 10 3 XE6: Vector-based XE6: Task-based XE6: Task-based, NZ-balanced XE6: Pure-MPI Runtime (s) 10 2 10 1 Parallel Efficiency (%) 140 120 100 80 60 40 32 64 128 256 512 1024 2048 4096 8192 No. of Cores XE6: Vector-based XE6: Task-based XE6: Task-based, NZ-balanced XE6: Pure-MPI 20 32 64 128 256 512 1024 2048 4096 8192 No. of Cores

Strong Scaling: Cray XE6 Runtime (s) 10 3 10 2 10 1 XE6: Vector-based XE6: Task-based XE6: Task-based, NZ-balanced XE6: Pure-MPI Parallel Efficiency (%) 120 110 100 90 80 70 60 50 40 30 256 512 1024 2048 4096 8192 16384 32768 No. of Cores XE6: Vector-based XE6: Task-based XE6: Task-based, NZ-balanced XE6: Pure-MPI 256 512 1024 2048 4096 8192 16384 32768 No. of Cores

Strong Scaling: BlueGene/Q 10 3 Runtime (s) Parallel Efficiency (%) 10 2 10 1 110 100 90 80 70 60 50 40 30 BGQ: Pure-MPI BGQ (SMT=4): Vector-based BGQ (SMT=4): Task-based BGQ (SMT=4): Task-based, NZ-balanced 128 256 512 1024 2048 4096 8192 No. of Cores BGQ: Pure-MPI BGQ (SMT=4): Vector-based BGQ (SMT=4): Task-based BGQ (SMT=4): Task-based, NZ-balanced 128 256 512 1024 2048 4096 8192 No. of Cores

Conclusion OpenMP threaded PETSc version Threaded vector and matrix operators Task-based sparse matrix multiplication Non-zero-based thread partitioning Strong scaling optimisation Performance deficit on small numbers of nodes (latency-bound) Increased performance in the strong limit (bandwidth-bound) Marshalling load imbalance Inter-process balance improved with less MPI ranks Load imbalance among threads handled explicitly

Acknowledgements Threaded PETSc version is available at: Open Petascale Libraries http://www.openpetascale.org/ The work presented here was funded by: Fujitsu Laboratories of Europe Ltd. European Commission in FP7 as part of the APOS-EU project Many thanks to: EPCC Hartree Centre PETSc development team