Introduction to Runtime Systems

Similar documents
The StarPU Runtime System

Overview of research activities Toward portability of performance

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Modern Processor Architectures. L25: Modern Compiler Design

CSC573: TSHA Introduction to Accelerators

GPUs and Emerging Architectures

OpenACC programming for GPGPUs: Rotor wake simulation

WHY PARALLEL PROCESSING? (CE-401)

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

45-year CPU Evolution: 1 Law -2 Equations

Parallel Programming on Larrabee. Tim Foley Intel Corp

StarPU: a runtime system for multigpu multicore machines

ECE 574 Cluster Computing Lecture 15

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

SIMD Exploitation in (JIT) Compilers

Why you should care about hardware locality and how.

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

CUDA Programming Model

GPU programming. Dr. Bernhard Kainz

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CS420: Operating Systems

Tesla Architecture, CUDA and Optimization Strategies

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

The Era of Heterogeneous Computing

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

Parallel Programming Libraries and implementations

Preparing seismic codes for GPUs and other

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Introduction II. Overview

! Readings! ! Room-level, on-chip! vs.!

Shared-memory Parallel Programming with Cilk Plus

OpenMP 4.0. Mark Bull, EPCC

Multi-Processors and GPU

OpenMP 4.0/4.5. Mark Bull, EPCC

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

GPU Fundamentals Jeff Larkin November 14, 2016

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Some features of modern CPUs. and how they help us

General introduction: GPUs and the realm of parallel architectures

AutoTune Workshop. Michael Gerndt Technische Universität München

Parallel Systems. Project topics

A low memory footprint OpenCL simulation of short-range particle interactions

Introduction to CUDA Programming

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Portland State University ECE 588/688. Graphics Processors

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Architecture, Programming and Performance of MIC Phi Coprocessor

Parallel Accelerators

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

CS516 Programming Languages and Compilers II

CPU-GPU Heterogeneous Computing

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Programmer's View of Execution Teminology Summary

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Intel Software Development Products for High Performance Computing and Parallel Programming

Warps and Reduction Algorithms

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Trends in HPC (hardware complexity and software challenges)

! XKaapi : a runtime for highly parallel (OpenMP) application

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Parallel Computing: Parallel Architectures Jin, Hai

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Vector Processors and Graphics Processing Units (GPUs)

COSC 6339 Accelerators in Big Data

Parallel Programming on Ranger and Stampede

CUDA. Matthew Joyner, Jeremy Williams

Trends and Challenges in Multicore Programming

Getting Started with Intel SDK for OpenCL Applications

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Lecture 11: GPU programming

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Scientific Computing on GPUs: GPU Architecture Overview

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Parallel Programming. Libraries and Implementations

ECE 574 Cluster Computing Lecture 17

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Parallel Accelerators

Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. In-Depth

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Performance of deal.ii on a node

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Transcription:

Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation

Contents 1. Introduction 2. Computing Hardware 3. Parallel Programming Models 4. Computing Runtime Systems Team Storm Olivier Aumage Runtime Systems 2

1Introduction Team Storm Olivier Aumage Runtime Systems 3

Hardware Evolution More capabilities, more complexity Team Storm Olivier Aumage Runtime Systems 1. Introduction 4

Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Team Storm Olivier Aumage Runtime Systems 1. Introduction 4

Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing Team Storm Olivier Aumage Runtime Systems 1. Introduction 4

Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing I/O RAID SSD vs Disks Network-attached disks Parallel file systems Team Storm Olivier Aumage Runtime Systems 1. Introduction 4

Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing I/O RAID SSD vs Disks Network-attached disks Parallel file systems Computing Multiprocessors, multicores Vector processing extensions Accelerators Team Storm Olivier Aumage Runtime Systems 1. Introduction 4

Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Vendor-tied code? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Vendor-tied code? Use runtime systems! Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Team Storm Olivier Aumage Runtime Systems 1. Introduction 6

The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Team Storm Olivier Aumage Runtime Systems 1. Introduction 6

The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Adaptiveness Load balancing Monitoring, sampling, calibrating Team Storm Olivier Aumage Runtime Systems 1. Introduction 6

The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Adaptiveness Load balancing Monitoring, sampling, calibrating Optimization Requests aggregation Resource locality Computation offload Computation/transfer overlap Team Storm Olivier Aumage Runtime Systems 1. Introduction 6

Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Team Storm Olivier Aumage Runtime Systems 1. Introduction 7

Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL Team Storm Olivier Aumage Runtime Systems 1. Introduction 7

Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL I/O MPI-IO Database engines (Google LevelDB) Team Storm Olivier Aumage Runtime Systems 1. Introduction 7

Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL I/O MPI-IO Database engines (Google LevelDB) Computing runtime systems?... Team Storm Olivier Aumage Runtime Systems 1. Introduction 7

2Computing Hardware Team Storm Olivier Aumage Runtime Systems 8

Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 9

Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Hardware Parallelism Multiply existing processing power Have several processing units work together Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 9

Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Hardware Parallelism Multiply existing processing power Have several processing units work together Not a new idea...... but becoming the key performance factor Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 9

Processor Parallelisms Various forms of hardware parallelism Multiprocessors Multicores Hardware multithreading (SMT) Vector processing (SIMD) Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 10

Processor Parallelisms Various forms of hardware parallelism Multiprocessors Multicores Hardware multithreading (SMT) Vector processing (SIMD) Multiple forms may be combined Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 10

Multiprocessors and Multicores Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 11

Multiprocessors and Multicores Multiprocessors Full processor replicates Rationale: Share node contents Share memory and devices Memory sharing may involve non-uniformity See upcoming hwloc and TreeMatch talks...! Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 11

Multiprocessors and Multicores Multiprocessors Full processor replicates Rationale: Share node contents Share memory and devices Memory sharing may involve non-uniformity See upcoming hwloc and TreeMatch talks...! Multicores Processor circuit replicates (cores) printed on the same dye Rationale: Use available dye area for more processing power Shrinking process Share memory and devices May share some additional dye circuitry (cache(s), uncore services) See upcoming hwloc and TreeMatch talks...! Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 11

Multiprocessors and Multicores Taking advantage of them? Needs multiple parallel application activities Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 12

Multiprocessors and Multicores Taking advantage of them? Needs multiple parallel application activities Additional considerations Availability Work mapping issues Locality issues Memory bandwidth issues Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 12

Hardware Multithreading Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 13

Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 13

Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Taking advantage of it? Needs multiple parallel application activities Highly dependent of application activities characteristics Complementary vs competitive Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 13

Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Taking advantage of it? Needs multiple parallel application activities Highly dependent of application activities characteristics Complementary vs competitive Additional considerations Availability Work mapping issues Locality issues Memory bandwidth issues Benefit vs loss Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 13

Vector Processing Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 14

Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 14

Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Taking advantage of it? Specially written kernels Compiler Use of assembly language Intrinsics Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 14

Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Taking advantage of it? Specially written kernels Compiler Use of assembly language Intrinsics Additional considerations Availability Feature set/variants MMX 3dnow! SSE [2...5] AVX... Benefit vs loss Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 14

Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 15

Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Single Instruction Multiple Threads (SIMT) A single control unit...... for several computing units Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 15

Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Streaming Multiprocessor Control R1 + R2 Scalar Cores (Streaming Processors) Single Instruction Multiple Threads (SIMT) A single control unit...... for several computing units Control R5 / R2 Scalar Cores DRAM GPU Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 15

Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Single Instruction Multiple Threads (SIMT) A single control unit...... for several computing units SIMT is distinct from SIMD Allows flows to diverge... but better avoid it! GPU Streaming Multiprocessor Control... if(cond){......... } else {...... }... R1 + R2 Scalar Cores (Streaming Processors) Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 15

GPU Hardware Model CPU CPU vs GPU Multiple strategies for multiple purposes CPU Strategy Large caches Large control Purpose Complex codes, branching Complex memory access patterns World Rally Championship car GPU Strategy Lot of computing power Simplified control Purpose Regular data parallel codes Simple memory access patterns Formula One car Control Cache DRAM DRAM ALU ALU ALU ALU GPU Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 16

GPU Software Model (SIMT) Kernels enclosed in implicit loop Iteration space One kernel instance...... for each space point Threads Execute work simultaneously Specific language NVidia CUDA OpenCL Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 17

GPU Software Model (SIMT) Kernels enclosed in implicit loop Iteration space One kernel instance...... for each space point Threads Execute work simultaneously Specific language NVidia CUDA OpenCL 1 global void 2 vecadd ( f l o a t A, 3 f l o a t B, 4 f l o a t C) { 5 i n t i = threadidx. x ; 6 C[ i ] = A [ i ]+B [ i ] ; 7 } 8 9 i n t 10 main ( ) { 11... 12 / / vecadd <<<1,NB>>> ( A, B,C) ; 13 for ( threadidx. x = 0; 14 threadidx. x < NB; 15 threadidx. x++) { 16 vecadd (A, B,C) ; 17 } 18... 19 } Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 17

GPU Software Model (SIMT) Hardware Abstraction Scalar core Execute instances of a kernel The thread executing a given instance is identified by the threadidx variable { // i = threadidx.x { { { int i = 0; int i = 1; int i = 2; int i = 3; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; } } } } 1 global void 2 vecadd ( f l o a t A, 3 f l o a t B, 4 f l o a t C) { 5 i n t i = threadidx. x ; 6 C[ i ] = A [ i ]+B [ i ] ; 7 } 8 9 i n t 10 main ( ) { 11... 12 / / vecadd <<<1,NB>>> ( A, B,C) ; 13 for ( threadidx. x = 0; 14 threadidx. x < NB; 15 threadidx. x++) { 16 vecadd (A, B,C) ; 17 } 18... 19 } Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 18

Manycores Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19

Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19

Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19

Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19

Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries... but no free lunch Kernels and applications need optimizing work Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19

Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries... but no free lunch Kernels and applications need optimizing work Discrete accelerator cards (for now!) Transfer data to card memory Transfer results back to main memory Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19

3Parallel Programming Models Team Storm Olivier Aumage Runtime Systems 20

Parallel Programming Models Languages Directive-based languages Specialized languages PGAS Languages... Libraries Linear algebra FFT... Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 21

Directive-Based Languages - Cilk Programming environment A language and compiler: keyword-based extension of C An execution model and a run-time system Recursive parallelism Divide-and-Conquer model Initially developed at the MIT Supertech Research Group Charles E. Leiserson s team Mid-90 s Now developed by Intel Available in ICC, GNU GCC Experimental version in LLVM/CLang 1 i n t f i b o ( i n t n ) 2 i n t r ; 3 i f ( n < 2) 4 r = n ; 5 else 6 i n t x, y ; 7 x = f i b o ( n 1) ; 8 y = f i b o ( n 2) ; 9 10 r = x + y ; 11 return r ; Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 22

Directive-Based Languages - Cilk Programming environment A language and compiler: keyword-based extension of C An execution model and a run-time system Recursive parallelism 1 c i l k i n t f i b o ( i n t n ) Divide-and-Conquer model 2 i n t r ; 3 i f ( n < 2) 4 r = n ; 5 Initially developed at the MIT else 6 i n t x, y ; Supertech Research Group 7 spawn x = f i b o ( n 1) ; Charles E. Leiserson s team 8 spawn y = f i b o ( n 2) ; Mid-90 s 9 sync Now developed by Intel 10 r = x + y ; Available in ICC, GNU GCC Experimental version in LLVM/CLang 11 return r ; Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 22

Directive-Based Languages - OpenMP Iterative parallelism Parallel section Team of threads 1 i n t i ; 2 3 4 { 5 6 for ( i = 0; i < N; i ++) { 7 C[ i ] = A [ i ] + B [ i ] ; 8 } 9 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 23

Directive-Based Languages - OpenMP Iterative parallelism Parallel section Team of threads 1 i n t i ; 2 3 #pragma omp p a r a l l e l 4 { 5 #pragma omp for 6 for ( i = 0; i < N; i ++) { 7 C[ i ] = A [ i ] + B [ i ] ; 8 } 9 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 23

Directive-Based Languages - OpenMP Iterative parallelism Parallel section Team of threads Task parallelism, recursive parallelism OpenMP 3.0 Task dependencies, accelerators OpenMP 4.0 1 l i s t p t r = l i s t _ h e a d ; 2 3 4 { 5 6 while ( p t r!= NULL) { 7 void data = p t r >data ; 8 9 10 11 { 12 process ( data ) ; 13 } 14 15 p t r = p t r >next ; 16 } 17 18 19 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 23

Directive-Based Languages - OpenMP Iterative parallelism Parallel section Team of threads Task parallelism, recursive parallelism OpenMP 3.0 Task dependencies, accelerators OpenMP 4.0 1 l i s t p t r = l i s t _ h e a d ; 2 3 #pragma omp p a r a l l e l 4 { 5 #pragma omp single 6 while ( p t r!= NULL) { 7 void data = p t r >data ; 8 9 #pragma omp task \ 10 f i r s t _ p r i v a t e ( data ) 11 { 12 process ( data ) ; 13 } 14 15 p t r = p t r >next ; 16 } 17 18 #pragma omp taskwait 19 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 23

PGAS Languages UPC Partitioned Global Address Space Unified Parallel C Global shared data Data distribution Parallel loops Threads Task extensions UPC Task Library 1 / / 2 3 i n t a [THREADS ] [ THREADS ] ; 4 i n t b [THREADS ] ; 5 i n t c [THREADS ] ; 6 i n t i, j ; 7 8 for ( i =0; i <THREADS; i ++) { 9 c [ i ] = 0 ; 10 for ( j =0; j <THREADS; j ++) { 11 c [ i ] += a [ i ] [ j ] b [ j ] ; 12 } 13 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 24

PGAS Languages UPC Partitioned Global Address Space Unified Parallel C Global shared data Data distribution Parallel loops Threads Task extensions UPC Task Library 1 # include <upc_relaxed. h> 2 3 shared [THREADS] i n t a [THREADS ] [ THREADS ] ; 4 shared i n t b [THREADS ] ; 5 shared i n t c [THREADS ] ; 6 i n t i, j ; 7 8 upc_forall ( i =0; i <THREADS; i ++; i ) { 9 c [ i ] = 0 ; 10 for ( j =0; j <THREADS; j ++) { 11 c [ i ] += a [ i ] [ j ] b [ j ] ; 12 } 13 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 24

Libraries Specialized libraries Black-box parallelism Linear Algebra BLAS, LAPACK Intel MKL, MAGMA, PLASMA Signal Processing FFTW, Spiral... Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 25

Common Denominator Many similar fundamental services Lower-level layer Abstraction/optimization layer Computing Runtime System Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 26

Common Denominator Many similar fundamental services Lower-level layer Abstraction/optimization layer Computing Runtime System Mapping work on computing resources Resolving trade-offs Optimizing Scheduling Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 26

4Computing Runtime Systems Team Storm Olivier Aumage Runtime Systems 27

Computing Runtime Systems Two classes Thread scheduling Task scheduling Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 28

Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 29

Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Examples Nowadays: libpthread Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 29

Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Examples Nowadays: libpthread Discussion Flexibility Resource consumption? Adaptiveness? Synchronization? Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 29

Task Scheduling Task Elementary computation Potential parallel work Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30

Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30

Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Recursive tasks vs non-blocking tasks Dependency management Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30

Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Recursive tasks vs non-blocking tasks Dependency management Examples StarPU Cilk s runtime, Intel Threading Building Blocks (TBB) StarSS / OmpSs PaRSEC... Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30

Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Recursive tasks vs non-blocking tasks Dependency management Examples StarPU Cilk s runtime, Intel Threading Building Blocks (TBB) StarSS / OmpSs PaRSEC... Discussion Abstraction Adaptiveness Transparent synchronization using dependencies Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30

Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 31

Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Communicate with discrete accelerator board(s) Send computation requests Send data to be processed Fetch results back Expensive Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 31

Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Communicate with discrete accelerator board(s) Send computation requests Send data to be processed Fetch results back Expensive Decide about worthiness Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 31

Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Communicate with discrete accelerator board(s) Send computation requests Send data to be processed Fetch results back Expensive Decide about worthiness See StarPU talk Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 31

Computing Runtimes Ecosystem Scheduling and Memory-Management Data transfers: CPU <-> discrete accelerator Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 32

Computing Runtimes Ecosystem Scheduling and Memory-Management Data transfers: CPU <-> discrete accelerator Minimize transfers Overlap transfers and requests with computation Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 32

Computing Runtimes Ecosystem Scheduling and Memory-Management Data transfers: CPU <-> discrete accelerator Minimize transfers Overlap transfers and requests with computation Cooperation with a Distributed Shared Memory system Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 32

Computing Runtimes Ecosystem Scheduling and Networking Distributed computing Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 33

Computing Runtimes Ecosystem Scheduling and Networking Distributed computing Interoperability, minimization, overlap Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 33

Computing Runtimes Ecosystem Scheduling and Networking Distributed computing Interoperability, minimization, overlap Cooperation with a network library MPI, Global Arrays, etc. Anticipate communication needs Merge multiple requests Throttle/alter scheduling with network events Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 33

Computing Runtimes Ecosystem Scheduling and I/O Out-of-core Very large computations Temporary storing large data structures on disk Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 34

Computing Runtimes Ecosystem Scheduling and I/O Out-of-core Very large computations Temporary storing large data structures on disk Interoperability, minimization, overlap Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 34

Computing Runtimes Ecosystem Scheduling and I/O Out-of-core Very large computations Temporary storing large data structures on disk Interoperability, minimization, overlap Cooperation with an I/O library When to store some data on disk? When to fetch it back? Heuristics Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 34

Computing Runtimes Ecosystem Scheduling, and Scheduling Theory Algorithmic Designing scheduling algorithms Testing scheduling algorithms in real life Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 35

Computing Runtimes Ecosystem Scheduling, and Scheduling Theory Algorithmic Designing scheduling algorithms Testing scheduling algorithms in real life Computing Runtimes as an interface framework Plug new algorithms Keep same interface Transparent for application Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 35

Conclusion Runtimes as interface frameworks Portability Control Adaptiveness Optimization Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 36

Conclusion Runtimes as interface frameworks Portability Control Adaptiveness Optimization Portability of Performance Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 36

Program of the Training Session Thursday, June 04: 09:00 (09:30) - 10:00: Introduction to Runtime Systems Olivier Aumage... coffee break... 10:15-12:00: The StarPU computing runtime (Part I) Olivier Aumage, Nathalie Furmento, Samuel Thibault... lunch break... 14:00-16:00: The Eztrace framework for performance debugging (Part I) Matias Hastaran, François Rué Friday, June 05: 09:00-11:00: The hardware locality library (hwloc) Brice Goglin... coffee break... 11:15-12:45: A process placement framework TreeMatch for multicore clusters Emmanuel Jeannot... lunch break... 14:00-16:00: The StarPU computing runtime (Part II) Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 37