MPI+X on The Way to Exascale. William Gropp

Similar documents
MPI+X on The Way to Exascale. William Gropp

MPI+X for Extreme Scale Computing

MPI+X for Extreme Scale Computing. William Gropp

MPI in 2020: Opportunities and Challenges. William Gropp

The Future of the Message-Passing Interface. William Gropp

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp

MPI: The Once and Future King. William Gropp wgropp.cs.illinois.edu

Challenges in Programming the Next Generation of HPC Systems. William Gropp wgropp.cs.illinois.edu

Carlo Cavazzoni, HPC department, CINECA

Lecture 20: Distributed Memory Parallelism. William Gropp

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Hybrid programming with MPI and OpenMP On the way to exascale

Compute Node Linux (CNL) The Evolution of a Compute OS

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Why you should care about hardware locality and how.

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

WHY PARALLEL PROCESSING? (CE-401)

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Operational Robustness of Accelerator Aware MPI

Review: Creating a Parallel Program. Programming for Performance

Performance of Multicore LUP Decomposition

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

HPC Architectures. Types of resource currently in use

Introduction to Xeon Phi. Bill Barth January 11, 2013

Systems Software for Scalable Applications (or) Super Faster Stronger MPI (and friends) for Blue Waters Applications

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

Introduction to parallel Computing

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Optimising for the p690 memory system

Introduction to Parallel Computing

HPC future trends from a science perspective

High Performance Computing (HPC) Introduction

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Issues in Developing a Thread-Safe MPI Implementation. William Gropp Rajeev Thakur Mathematics and Computer Science Division

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Online Course Evaluation. What we will do in the last week?

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

A Lightweight OpenMP Runtime

Programming Models for Supercomputing in the Era of Multicore

OpenMP for next generation heterogeneous clusters

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Parallel Programming Futures: What We Have and Will Have Will Not Be Enough

Best Practices for Setting BIOS Parameters for Performance

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

Experiences with ENZO on the Intel Many Integrated Core Architecture

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

An Introduction to Parallel Programming

CUDA GPGPU Workshop 2012

6.1 Multiprocessor Computing Environment

Hybrid Model Parallel Programs

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

PGAS languages. The facts, the myths and the requirements. Dr Michèle Weiland Monday, 1 October 12

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Optimization of MPI Applications Rolf Rabenseifner

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Cray XC Scalability and the Aries Network Tony Ford

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

MPI on the Cray XC30

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017

MPI - Today and Tomorrow

Parallel and High Performance Computing CSE 745

Modern Processor Architectures. L25: Modern Compiler Design

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

It s not my fault! Finding errors in parallel codes 找並行程序的錯誤

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Issues in Multiprocessors

Interconnect Your Future

Hybrid Programming with MPI and SMPSs

MPI Programming Techniques

Challenges and Opportunities for HPC Interconnects and MPI

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Issues in Multiprocessors

arxiv: v1 [cs.dc] 27 Sep 2018

GPUs and Emerging Architectures

MPI Forum: Preview of the MPI 3 Standard

Overview of research activities Toward portability of performance

Trends and Challenges in Multicore Programming

Transcription:

MPI+X on The Way to Exascale William Gropp http://wgropp.cs.illinois.edu

Likely Exascale Architectures (Low Capacity, High Bandwidth) 3D Stacked Memory (High Capacity, Low Bandwidth) Thin Cores / Accelerators Fat Core Fat Core DRAM NVRAM Integrated NIC for Off-Chip Communication Core Coherence Domain Note: not fully cache coherent Figure 2.1: Abstract Machine Model of an exascale Node Architecture From Abstract Machine Models and Proxy Architectures for Exascale Computing Rev 1.1, J Ang et al

Another Pre-Exascale Architecture Sunway TaihuLight Heterogeneous processors (MPE, CPE) No data cache

MPI (The Standard) Can Scale Beyond Exascale MPI implementations already supporting more than 1M processes Several systems (including Blue Waters) with over 0.5M independent cores Many Exascale designs have a similar number of nodes as today s systems MPI as the internode programming system seems likely There are challenges Connection management Buffer management Memory footprint Fast collective operations And no implementation is as good as it needs to be, but There are no intractable problems here MPI implementations can be engineered to support Exascale systems, even in the MPIeverywhere

Applications Still Mostly MPI-Everywhere the larger jobs (> 4096 nodes) mostly use message passing with no threading. BW Workload study, https://arxiv.org/ftp/arxiv/papers/1703/1703.00924.pdf Benefit of programmer-managed locality Memory performance nearly stagnant Parallelism for performance implies locality must be managed effectively Benefit of a single programming system Often stated as desirable but with little evidence Common to mix Fortran, C, Python, etc. But Interface between systems must work well, and often don t E.g., for MPI+OpenMP, who manages the cores and how is that negotiated?

Why Do Anything Else? Performance May avoid memory (though usually not cache) copies Easier load balance Shift work among cores with shared memory More efficient fine-grain algorithms Load/store rather than routine calls Option for algorithms that include races (asynchronous iteration, ILU approximations) Adapt to modern node architecture

SMP Nodes: One Model NIC NIC

Classic Performance Model s + r n Sometimes called the postal model Model combines overhead and network latency (s) and a single communication rate 1/r for n bytes of data Good fit to machines when it was introduced But does it match modern SMP-based machines? Let s look at the the communication rate per process with processes communicating between two nodes

Rates Per Bandwidth Bandwidth Ping-pong between 2 nodes using 1-16 cores on each node Top is BG/Q, bottom Cray XE6 Classic model predicts a single curve rates independent of the number of communicating processes

Why this Behavior? The T = s + r n model predicts the same performance independent of the number of communicating processes What is going on? How should we model the time for communication?

SMP Nodes: One Model NIC NIC

Modeling the Communication Each link can support a rate r L of data Data is pipelined (Logp model) Store and forward analysis is different Overhead is completely parallel k processes sending one short message each takes the same time as one process sending one short message

A Slightly Better Model Assume that the sustained communication rate is limited by The maximum rate along any shared link The link between NICs The aggregate rate along parallel links Each of the links from an MPI process to/from the NIC 7.00E+09 6.00E+09 5.00E+09 4.00E+09 3.00E+09 2.00E+09 1.00E+09 0.00E+00 n=256k n=512k n=1m n=2m Aggregate Bandwidth Not double single process rate Reached maximum data rate 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

A Slightly Better Model For k processes sending messages, the sustained rate is min(r NIC-NIC, k R CORE-NIC ) Thus T = s + k n/min(r NIC-NIC, k R CORE-NIC ) Note if R NIC-NIC is very large (very fast network), this reduces to T = s + k n/(k R CORE-NIC ) = s + n/r CORE-NIC

Two Examples Two simplified examples: Blue Gene/Q Cray XE6 Node Node NIC Node Note differences: BG/Q : Multiple paths into the network Cray XE6: Single path to NIC (shared by 2 nodes) Multiple processes on a node sending can exceed the available bandwidth of the single path

The Test Nodecomm discovers the underlying physical topology Performs point-to-point communication (ping-pong) using 1 to # cores per node to another node (or another chip if a node has multiple chips) Outputs communication time for 1 to # cores along a single channel Note that hardware may route some communication along a longer path to avoid contention. The following results use the code available soon at https://bitbucket.org/william gropp/baseenv

How Well Does this Model Work? Tested on a wide range of systems: Cray XE6 with Gemini network IBM BG/Q Cluster with InfiniBand Cluster with another network Results in Modeling MPI Communication Performance on SMP Nodes: Is it Time to Retire the Ping Pong Test W Gropp, L Olson, P Samfass Proceedings of EuroMPI 16 https://doi.org/10.1145/2966884.2966919 Cray XE6 results follow

Cray: Measured Data

Cray: 3 parameter (new) model

Cray: 2 parameter model

Notes Both Cray XE6 and IBM BG/Q have inadequate bandwidth to support each core sending data along the same link But BG/Q has more independent links, so it is able to sustain a higher effective halo exchange

Ensuring Application Performance and Scalability Defer synchronization and overlap communication and computation Need to support asynchronous progress Avoid busy-wait/polling Reduce off-node communication Careful mapping of processes/threads to nodes/cores Reduce intranode message copies

What To Use as X in MPI + X? Threads and Tasks OpenMP, pthreads, TBB, OmpSs, StarPU, Streams (esp for accelerators) OpenCL, OpenACC, CUDA, Alternative distributed memory system UPC, CAF, Global Arrays, GASPI/GPI MPI shared memory

X = MPI (or X = ϕ) MPI 3.1 features esp. important for Exascale Generalize collectives to encourage post BSP (Bulk Synchronous Programming) approach: Nonblocking collectives Neighbor including nonblocking collectives Enhanced one-sided Precisely specified (see Remote Memory Access Programming in MPI-3, Hoefler et at, to appear in ACM TOPC) Many more operations including RMW Enhanced thread safety

X = Programming with Threads Many choices, different user targets and performance goals Libraries: Pthreads, TBB Languages: OpenMP 4, C11/C++11 C11 provides an adequate (and thus complex) memory model to write portable thread code Also needed for MPI-3 shared memory; see Threads cannot be implemented as a library, http://www.hpl.hp.com/techreports/2004/ HPL-2004-209.html

X=UPC (or CAF or ) es are UPC programs (not threads), spanning multiple coherence domains. This model is the closest counterpart to the MPI+OpenMP model, using PGAS to extend the process beyond a single coherence domain. Could be PGAS across chip Memory CPU CPU CPU Memory CPU CPU CPU Memory CPU CPU CPU / UPC Program Memory CPU CPU CPU Memory CPU CPU CPU Memory CPU CPU CPU

What are the Issues? Isn t the beauty of MPI + X that MPI and X can be learned (by users) and implemented (by developers) independently? Yes (sort of) for users No for developers MPI and X must either partition or share resources User must not blindly oversubscribe Developers must negotiate

More Effort needed on the + MPI+X won t be enough for Exascale if the work for + is not done very well Some of this may be language specification: User-provided guidance on resource allocation, e.g., MPI_Info hints; thread-based endpoints Some is developer-level standardization A simple example is the MPI ABI specification users should ignore but benefit from developers supporting

Some Resources to Negotiate CPU resources Threads and contexts Cores (incl placement) Cache Memory resources Prefetch, outstanding load/ stores Pinned pages or equivalent NIC needs Transactional memory regions Memory use (buffers) NIC resources Collective groups Routes Power OS resources Synchronization hardware Scheduling Virtual memory Cores (dark silicon)

Hybrid Programming with Shared Memory MPI-3 allows different processes to allocate shared memory through MPI MPI_Win_allocate_shared Uses many of the concepts of one-sided communication Applications can do hybrid programming using MPI or load/ store accesses on the shared memory window Other MPI functions can be used to synchronize access to shared memory regions Can be simpler to program for both correctness and performance than threads because of clearer locality model

A Hybrid Thread-Multiple Ping Pong Benchmark In a hybrid thread-multiple approach, what if t threads communicate instead of t processes? The benchmark was extended towards a multithreaded version where t threads do the ping pong exchange for a single process per node (i.e., k = 1) Results for Blue Waters (Cray XE6) The number t of threads and message sizes n are varied Results show Our performance model no longer applies Performance of multithreaded version is poor This is due to excessive spin and wait times spent in the MPI library Not an MPI problem but a problem in the implementation of MPI

Results for Multithreaded Ping Pong Benchmark Coarse-Grained Locking Measurements for single-threaded benchmark Measurements for multi-threaded benchmark

Results for Multithreaded Ping Pong Benchmark Fine-Grained Locking Measurements for single-threaded benchmark Measurements for multi-threaded benchmark

Implications For Hybrid Programming Model and measurements on Blue Waters suggest that if a fixed amount of data needs to be transferred from one node to another, the hybrid master-only style will have a disadvantage compared to pure MPI The disadvantage might not be visible for very large messages where a single thread (calling MPI in the master-only style) might be able to saturate the NIC In addition, a thread-multiple hybrid approach seems to be currently infeasible because of a severe performance decline in the current MPI implementations Again, not a fundamental problem in MPI; rather, an example of the difficulty of achieving high performance with general threads

Lessons Learned Achieving good performance with hybrid parallelism requires careful management of concurrency, locality Fine-grain approach has potential but suffers in practice; coarse-grain approach requires more programmer effort but gives better performance MPI+MPI and MPI+OpenMP both practical Concurrent processing of non-contiguous data also important (gives advantage to multiple MPI processes; competes with load balancing Problem decomposition and (hybrid) parallel communication performance are interdependent, a holistic approach is therefore essential

Summary Multi- and Many-core nodes require a new communication performance model Implies a different approach to algorithms and increased emphasis on support for asynchronous progress Intra-node communication with shared memory can improve performance, but Locality remains critical Fast memory synchronization, signaling essential Most (all?) current MPI implementations have very slow intranode MPI_Barrier.

Thanks! Philipp Samfass Luke Olson Pavan Balaji, Rajeev Thakur, Torsten Hoefler ExxonMobile Upstream Research Blue Waters Sustained Petascale Project, supported by the National Science Foundation (award number OCI 07 25070) and the state of Illinois. Cisco Systems for access to the Arcetri UCS Balanced Technical Computing Cluster