MPI in 2020: Opportunities and Challenges. William Gropp

Similar documents
The Future of the Message-Passing Interface. William Gropp

MPI+X on The Way to Exascale. William Gropp

MPI+X on The Way to Exascale. William Gropp

MPI: The Once and Future King. William Gropp wgropp.cs.illinois.edu

Lecture 20: Distributed Memory Parallelism. William Gropp

MPI+X for Extreme Scale Computing. William Gropp

Challenges in High Performance Computing. William Gropp

MPI+X for Extreme Scale Computing

Mapping MPI+X Applications to Multi-GPU Architectures

Trends in HPC (hardware complexity and software challenges)

OpenACC 2.6 Proposed Features

ECE 8823: GPU Architectures. Objectives

MPI 3 and Beyond: Why MPI is Successful and What Challenges it Faces. William Gropp

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Carlo Cavazzoni, HPC department, CINECA

Performance, Portability, and Dreams. William Gropp wgropp.cs.illinois.edu

Parallelism. CS6787 Lecture 8 Fall 2017

The Challenges of Exascale. William D Gropp

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016

A Lightweight OpenMP Runtime

Parallel and High Performance Computing CSE 745

Overview of Tianhe-2

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

ECE 574 Cluster Computing Lecture 23

Programming Models for Supercomputing in the Era of Multicore

The MPI Message-passing Standard Practical use and implementation (I) SPD Course 2/03/2010 Massimo Coppola

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research

HPC future trends from a science perspective

CUDA. Matthew Joyner, Jeremy Williams

Timothy Lanfear, NVIDIA HPC

Operational Robustness of Accelerator Aware MPI

Unifying UPC and MPI Runtimes: Experience with MVAPICH

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp

! Readings! ! Room-level, on-chip! vs.!

Issues in Multiprocessors

Evaluating New Communication Models in the Nek5000 Code for Exascale

GPU Fundamentals Jeff Larkin November 14, 2016

Leveraging Flash in HPC Systems

Supercomputing and Mass Market Desktops

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D,

Interconnect Your Future

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

The Future of High Performance Computing

Overview of research activities Toward portability of performance

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

MPI Forum: Preview of the MPI 3 Standard

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

High performance computing and numerical modeling

Introduction. Distributed Systems IT332

Half full or half empty? William Gropp Mathematics and Computer Science

Scalable In-memory Checkpoint with Automatic Restart on Failures

LLVM for the future of Supercomputing

Engineering Performance for Multiphysics Applications. William Gropp

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

UCX: An Open Source Framework for HPC Network APIs and Beyond

User Level Failure Mitigation in MPI

Distributed Systems Principles and Paradigms. Chapter 01: Introduction. Contents. Distributed System: Definition.

Distributed Systems Principles and Paradigms

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

igpu: Exception Support and Speculation Execution on GPUs Jaikrishnan Menon, Marc de Kruijf University of Wisconsin-Madison ISCA 2012

GPU-centric communication for improved efficiency

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

MPI: A Message-Passing Interface Standard

From the latency to the throughput age. Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC

Distributed Systems Principles and Paradigms. Chapter 01: Introduction

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

How HPC Hardware and Software are Evolving Towards Exascale

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Scalable Shared Memory Programing

PGI Accelerator Programming Model for Fortran & C

GPI-2: a PGAS API for asynchronous and scalable parallel applications

PGAS languages. The facts, the myths and the requirements. Dr Michèle Weiland Monday, 1 October 12

OpenACC Course. Office Hour #2 Q&A

Exascale: Parallelism gone wild!

Distributed Systems. Overview. Distributed Systems September A distributed system is a piece of software that ensures that:

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Challenges in HPC I/O

Fault Tolerance Techniques for Sparse Matrix Methods

6.1 Multiprocessor Computing Environment

I/O Profiling Towards the Exascale

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Continuum Computer Architecture

Parallelism. Parallel Hardware. Introduction to Computer Systems

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Comparing Memory Systems for Chip Multiprocessors

Hybrid Model Parallel Programs

Steve Scott, Tesla CTO SC 11 November 15, 2011

CA464 Distributed Programming

GPUs have enormous power that is enormously difficult to use

Issues in Multiprocessors

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

The Use of Cloud Computing Resources in an HPC Environment

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Transcription:

MPI in 2020: Opportunities and Challenges William Gropp www.cs.illinois.edu/~wgropp

MPI and Supercomputing The Message Passing Interface (MPI) has been amazingly successful First released in 1992, it is still the dominant programming system used to program the world s fastest computers The most recent version, MPI 3.1, released in June 2015, contains many features to support systems with >100K processes and state-of-the-art networks Supercomputing (and computing) is reaching a critical point as the end of Dennard scaling has forced major changes in processor architecture. This talk looks at the future of MPI from the point of view of Extreme scale systems That technology will also be used in single rack systems 2

Likely Exascale Architectures (Low Capacity, High Bandwidth) 3D Stacked Memory (High Capacity, Low Bandwidth) Thin Cores / Accelerators Fat Core Fat Core DRAM NVRAM Integrated NIC for Off-Chip Communication Core Coherence Domain Note: not fully cache coherent Figure 2.1: Abstract Machine Model of an exascale Node Architecture From Abstract Machine Models and Proxy Architectures for Exascale Computing Rev 1.1, J Ang et al Note that I/O is not part of this (maybe hung off the NIC) 3

What This (Might) Mean for MPI Lots of innovation in the processor and the node More complex memory hierarchy; no chip-wide cache coherence Tightly integrated NIC Execution model becoming more complex Achieving performance, reliability targets requires exploiting new features Node programming changing OpenMP/OpenACC/CUDA; shared memory features in C11/C++11 4

What This (Might) Mean for Applications Weak scaling limits the range of problems Latency may be critical (also, some applications nearing limits of spatial parallelism) Rich execution model makes performance portability unrealistic Applications will need to be flexible with both their use of abstractions and their implementation of those abstractions One Answer: Programmers will need help with performance issues, whatever parallel programming system is used Much of this is independent of the internode parallelism, and can use DSLs, annotations, sourceto-source transformations. 5

Where Is MPI Today? Applications already running at large scale: System Tianhe-2 6 Cores Sequoia 1,572,864 Blue Waters Mira 786,432 K computer 705,024 Julich BG/Q 458,752 Titan 3,120,000 (most in Phi) 792,064* + 59,136smx * 2 cores share a wide FP unit 299,008* + 261,632smx

MPI+X Many reasons to consider MPI+X Major: We always have: MPI+C, MPI+Fortran Both C11 and Fortran include support of parallelism (shared (C) and distributed memory (Fortran)) Abstract execution models becoming more complex Experience has shown that the programmer must be given some access to performance features Options are (a) add support to MPI and (b) let X support some aspects 7

Many Possible Values of X X = MPI (or X = ϕ) MPI 3 has many features esp. important for Extreme scale Nonblocking collectives, neighbor collectives, MPI 4 looking at additional features (e.g., RMA with notify; come to the MPI BoF today!) X = threads (OpenMP/pthreads/C11) C11 provides an adequate (and thus complex) memory model for writing portable thread code X = CAF or UPC or other (A)PGAS Think of as an extension of a thread model 8

What are the Issues? Isn t the beauty of MPI + X that MPI and X can be learned (by users) and implemented (by developers) independently? Yes (sort of) for users No for developers MPI and X must either partition or share resources User must not blindly oversubscribe Developers must negotiate 9

More Effort needed on the + MPI+X won t be enough for Exascale if the work for + is not done very well Some of this may be language specification: User-provided guidance on resource allocation, e.g., MPI_Info hints; thread-based endpoints Some is developer-level standardization A simple example is the MPI ABI specification users should ignore but benefit from developers supporting 10

Which MPI? Many new features in MPI-3 Many programs still use subsets of MPI-1 MPI implementations still improving A long process harmed by nonstandard shortcuts MPI Forum is active and considering new features relevant for Exascale MPI 3.1 released June 2015 See the MPI BoF Today for more info! 11

Fault Tolerance Often raised as a major issue for Exascale systems Experience has shown systems more reliable than simple extrapolations assumed Hardly surprising reliability is costly, so systems engineered only to the reliability needed Major question: What is the fault model? Process failure (why is this the common model?) Software then program is buggy. Recovery may not make sense Hardware Where (CPU/Memory/NIC/Cables)? Recovery may be easy or impossible Silent data corruption (SDC) Most effort in MPI Forum is on process failstop faults 12

Separate Coherence Domains and Address Spaces Already many systems without cache coherence and with separate address spaces GPUs best example; unlikely to change even when integrated on chip OpenACC an X that supports this MPI designed for this case Despite common practice, MPI definition of MPI_Get_address supports, for example, segmented address spaces; MPI_Aint_add etc. provides portable address arithmetic MPI RMA separate memory model also fits this case Separate model defined in MPI-2 to support the World s fastest machines, including NEC SX series and Earth Simulator 13

Towards MPI-4 Many extensions being considered, either by the Forum or as Research, including Other communication paradigms Active messages Toward Asynchronous and MPI-Interoperable Active Messages, Zhao et al, CCGrid 13 Streams Tighter integration with threads Endpoints Data centric More flexible datatypes Faster datatype implementations (see, e.g., Prabhu & Gropp, EuroMPI 15) Better parallel file systems (match the MPI I/O semantics) Unified address space handling E.g., GPU memory to GPU memory without CPU processing 14

MPI is not a BSP system BSP = Bulk Synchronous Programming Programmers like the BSP model, adopting it even when not necessary (see functionally irrelevant barriers ) Unlike most programming models, designed with a performance model to encourage quantitative design in programs MPI makes it easy to emulate a BSP system Rich set of collectives, barriers, blocking operations MPI (even MPI-1) sufficient for dynamic adaptive programming The main issues are performance and progress Improving implementations and better HW support for integrated CPU/NIC coordination is the right answer 15

Some Remaining Issues Latency and overheads Libraries add overheads Several groups working on applying compiler techniques to MPI and to using annotations to transform user s code; can address some issues Execution model mismatch How to make it easy for the programmer to express operations in a way that makes it easy to exploit innovative hardware or runtime features? Especially important for Exascale, as innovation essential in meeting 20MW, MTBF, total memory, etc. 16

What Are The Real Problems? Support for application-specific, distributed data structures Not an MPI problem Very hard to solve in general Data-structure Specific Language (often called domain specific language) a better solution A practical execution model with a performance model Greater attention to latency Directly relates to programmability 17

MPI in 2020 Alive and well, using C11, C++11, Fortran 2008 (or later) Node programming uses locality-aware, autotuning programming systems More use of RMA features Depends on better MPI implementations, continued co-evolution of MPI and RMA hardware to add new features (notification?) (Partial?) solution of the + problem At least an ad hoc implementers standard for sharing most critical resources Some support for fault tolerance Probably not at the level needed for reliable systems but ok for simulations Better I/O support, including higher level libraries But only if the underlying system implements something better than POSIX I/O 18