Overview of research activities Toward portability of performance

Similar documents
StarPU: a runtime system for multigpu multicore machines

Programming heterogeneous, accelerator-based multicore machines: a runtime system's perspective

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

Introduction to Runtime Systems

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

The StarPU Runtime System

Trends and Challenges in Multicore Programming

Getting started with StarPU

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Parallel Programming Libraries and implementations

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

Threaded Programming. Lecture 9: Alternatives to OpenMP

HPC with Multicore and GPUs

CPU-GPU Heterogeneous Computing

AutoTune Workshop. Michael Gerndt Technische Universität München

Introduction to parallel Computing

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

Parallel Programming. Libraries and Implementations

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Architecture, Programming and Performance of MIC Phi Coprocessor

MAGMA. Matrix Algebra on GPU and Multicore Architectures

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

GPUs and Emerging Architectures

Dealing with Heterogeneous Multicores

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Programming Models for Supercomputing in the Era of Multicore

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Toward a supernodal sparse direct solver over DAG runtimes

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Introduction to Multicore Programming

CellSs Making it easier to program the Cell Broadband Engine processor

Miwako TSUJI XcalableMP

! XKaapi : a runtime for highly parallel (OpenMP) application

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

Parallel Hybrid Computing Stéphane Bihan, CAPS

Modeling and Simulation of Hybrid Systems

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Introduction to Multicore Programming

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

Programming Support for Heterogeneous Parallel Systems

Distributed & Heterogeneous Programming in C++ for HPC at SC17

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Effective Performance Measurement and Analysis of Multithreaded Applications

MAGMA: a New Generation

Chapter 4: Multithreaded Programming

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

PORTING PARALLEL APPLICATIONS TO HETEROGENEOUS SUPERCOMPUTERS: LIBRARIES AND TOOLS CAN MAKE IT TRANSPARENT

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

CUDA GPGPU Workshop 2012

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

The Art of Parallel Processing

HPC Architectures. Types of resource currently in use

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Hierarchical DAG Scheduling for Hybrid Distributed Systems

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

CUDA. Matthew Joyner, Jeremy Williams

Heterogeneous-Race-Free Memory Models

Making Dataflow Programming Ubiquitous for Scientific Computing

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Addressing Heterogeneity in Manycore Applications

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Table of Contents. Cilk

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Dynamic inter-core scheduling in Barrelfish

Intel Software Development Products for High Performance Computing and Parallel Programming

Introducing Task-Containers as an Alternative to Runtime Stacking

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

Performance Tools for Technical Computing

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Intel Many Integrated Core (MIC) Architecture

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

OpenACC 2.6 Proposed Features

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Chapter 3 Parallel Software

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Parallel Programming on Larrabee. Tim Foley Intel Corp

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Parallel Systems. Project topics

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

Why you should care about hardware locality and how.

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Massively Parallel Architectures

Moore s Law. Computer architect goal Software developer assumption

A low memory footprint OpenCL simulation of short-range particle interactions

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Ultra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

CS420: Operating Systems

How to Write Code that Will Survive the Many-Core Revolution

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

Parallelism paradigms

Transcription:

Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into the runtime! Parallel Compilers HPC Applications Parallel Libraries Exploiting shared memory machines Thread scheduling over hierarchical multicore architectures OpenMP Task scheduling over accelerator-based machines Communication over high speed networks Multicore-aware communication engines Multithreaded MPI implementations Integration of multithreading and communication Runtime support for hybrid programming MPI + OpenMP + CUDA + TBB + Runtime system Operating System GPU

Heterogeneous computing is here And portable programming is getting harder GPU are the new kids on the block Very powerful data-parallel accelerators Specific instruction set No hardware memory consistency Clusters featuring accelerators are already heading the Top500 list Tianhe-1A (#1) Nebulae (#3) Tsubame 2.0 (#5) Roadrunner (#?) Using GPUs as side accelerators is not enough GPU = first class citizens

Heterogeneous computing is here How shall we program heterogeneous clusters? The hard hybrid way Combine different paradigms by hand MPI + {OpenMP/TBB/???} + {CUDA/OpenCL} Portability is hard to achieve Work distribution depends on #GPU & # per node Tools such as S-GPU may help! Needs aggressive autotuning Currently used for building parallel numerical kernels MAGMA, D-PLASMA, FFT kernels Multicore OpenMP TBB? Cilk? MPI *PU M. M. Accelerators ALF CUDA OpenCL Ct *PU M.

Heterogeneous computing is here Mixing different paradigms leads to several issues Semantics issues MPI and OpenMP don t mix easily E.g. MPI communication inside parallel regions Higher-level abstractions would help! Think about domain-decomposition algorithms Resource allocation issues Can we really use several hybrid parallel kernels simultaneously? Ever tried to mix OpenMP and MKL? Could be helpful in order to exploit millions of cores It s all about composability Probably the biggest challenge for runtime systems Hybridization will mostly be indirect (linking libraries) And with composability come a lot of related issues Need for autotuning / scheduling hints

Runtime systems enabling composability Background So far, we ve been working on providing a common runtime system for MPI + (OpenMP)* = multiple OpenMP kernels mixed inside an MPI application Main features Hierarchical thread scheduling (with potential oversubscription) Topology-aware, adaptive parallelism Give more cores to regions that scale better! Towards a common, unified runtime system? MKL PLASMA OpenMP Intel TBB HMPP Task Management (Threads/Tasklets/Codelets) Unified Multicore Runtime System Data distribution facilities MPI implementations I/O services Topology-aware Scheduling Memory Management Synchronization

Heterogeneous computing is here (cont d) How shall we program heterogeneous clusters? The uniform way Use a single (or a combination of) high level programming language to deal with network + multicore + accelerators Increasing number of directive-based languages Use simple directives and good compilers! XcalableMP PGAS approach HMPP, OpenMPC, OpenMP 4.0 Generate CUDA from OpenMP code StarSs Much better potential for composability If compiler is clever! Multicore OpenMP *PU M. M. Accelerators HMPP XMP OpenMPC? StarSs *PU M.

We need new runtime systems! Leveraging CUDA/OpenCL Kernels need to exploit GPUs AND s simultaneously Kernels need to run simultaneously Kernels need to accommodate to a variable number of processing units

Overview of StarPU A runtime system for heterogeneous architectures Rational Dynamically schedule tasks on all processing units See a pool of heterogeneous processing units Avoid unnecessary data transfers between accelerators Software VSM for heterogeneous machines M. B M. A A = A+B GPU GPU GPU GPU M. M. B M. M.

Overview of StarPU Maximizing PU occupancy, minimizing data transfers Ideas Accept tasks that may have multiple implementations Together with potential inter-dependencies Leads to a dynamic acyclic graph of tasks Parallel Compilers Applications Parallel Libraries Provide a high-level data management layer Application should only describe which data may be accessed by tasks How data may be divided StarPU Drivers (CUDA, OpenCL) GPU

Memory Management Automating data transfers StarPU provides a Virtual Shared Memory subsystem Weak consistency Explicit data fetch Replication MSI protocol Single writer Except for specific, accumulation data High-level API Partitioning filters Parallel Compilers Applications StarPU Parallel Libraries Drivers (CUDA, OpenCL) Input & output of tasks = reference to VSM data GPU

Tasks scheduling Dealing with heterogeneous hardware accelerators Tasks = Data input & output Dependencies with other tasks Multiple implementations E.g. CUDA + implementation Scheduling hints Parallel Compilers HPC Applications Parallel Libraries StarPU provides an Open Scheduling platform Scheduling algorithm = plug-ins f cpu gpu spu StarPU Drivers (CUDA, OpenCL) (ARW, BR, CR) GPU

Tasks scheduling How does it work? When a task is submitted, it first goes into a pool of frozen tasks until all dependencies are met Push Then, the task is pushed to the scheduler Scheduler Idle processing units actively poll for work ( pop ) Pop Pop What happens inside the scheduler is up to you! workers GPU workers

Tasks scheduling Developing your own scheduler Queue based scheduler Each worker «pops» task in a specific queue Push Implementing a strategy Easy! Select queue topology Implement «pop» and «push» Priority tasks Work stealing Performance models, Pop Scheduling algorithms testbed workers GPU workers

Tasks scheduling Developing your own scheduler Queue based scheduler Each worker «pops» task in a specific queue Push Implementing a strategy Easy! Select queue topology Implement «pop» and «push» Priority tasks Work stealing Performance models,? Pop Scheduling algorithms testbed workers GPU workers

Dealing with heterogeneous architectures Performance prediction Task completion time estimation History-based User-defined cost function Parametric cost model Can be used to improve scheduling E.g. Heterogeneous Earliest Finish Time cpu #1 cpu #2 cpu #3 gpu #1 gpu #2 time

Dealing with heterogeneous architectures Performance prediction Data transfer time estimation Sampling based on off-line calibration Can be used to Better estimate overall exec time Minimize data movements cpu #1 cpu #2 cpu #3 gpu #1 gpu #2 time

Dealing with heterogeneous architectures Performance On the influence of the scheduling policy LU decomposition 8 s (Nehalem) + 3 GPUs (FX5800) 80% of work goes on GPUs, 20% on s StarPU exhibits good scalability wrt: Problem size Number of GPUs

Dealing with heterogeneous architectures Implementing MAGMA on top of StarPU With University of Tennessee & INRIA HiePACS Cholesky decomposition 5 s (Nehalem) + 3 GPUs (FX5800) Efficiency > 100% Performance (Gflop/s) 1000 900 800 700 600 500 400 300 200 100 3 GPUs + 5 s 3 GPUs 2 GPUs 1 GPU 0 5120 15360 25600 35840 46080 Matrix order 4 GB

Using StarPU through a standard API A StarPU driver for OpenCL Run legacy OpenCL codes on top of StarPU OpenCL sees a number of starpu devices Performance limitations Data transfers performed just-in-time Data replication not managed by StarPU Ongoing work We propose light extensions to OpenCL Greatly improves flexibility when used Regular OpenCL behavior if not extension is used Legacy OpenCL Application OpenCL StarPU GPU

Integration with Multithreading Dealing with parallel StarPU tasks StarPU + OpenMP/TBB/ Many algorithms can take advantage of shared memory We can t seriously taskify the world! The Stencil case When neighbor tasks can be scheduled on a single node Just use shared memory! Hence an OpenMP stencil kernel

Integration with and Multithreading Dealing with parallel StarPU tasks Current approach Let StarPU spawn OpenMP tasks Performance modeling would still be valid Would also work with other tools E.g. Intel TBB How to find the appropriate granularity? May depend on the concurrent tasks! StarPU tasks = first class citizen Need to bridge the gap with existing parallel languages workers GPU workers

High-level integration Generating StarPU code out of StarSs Experiments with StarSs [UPC Barcelona] Writing StarSs +OpenMP code is easy Platform for experimenting hybrid scheduling OpenMP + StarPU #pragma css task inout(v) void scale_vector(float *v, float a, size_t n); #pragma css target device(smp) implements (scale_vector) void scale_vector_cpu(float *v, float a, size_t n) { int i; for (i = 0; i < n; i++) v[i] *= a; } int main(void) { float v[] = {1, 2, 3, 4, 5, 6, 7, 8, 9}; size_t vs = sizeof(v)/sizeof(*v); #pragma css start scale_vector(v, 4, vs);

Future work Propose natural extensions to OpenCL Introduce more dynamicity Enhance cooperation between runtime systems and compilers Granularity, runtime support for divisible tasks Feedback for autotuning software [PEPPHER European project] Demonstrate the relevance of StarPU in other frameworks StarPU+OpenMP+MPI as a target for XcalableMP French-Japanese ANR-JST FP3C project

Thank you! More information about StarPU http://runtime.bordeaux.inria.fr