Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation
Contents 1. Introduction 2. Computing Hardware 3. Parallel Programming Models 4. Computing Runtime Systems Team Storm Olivier Aumage Runtime Systems 2
1Introduction Team Storm Olivier Aumage Runtime Systems 3
Hardware Evolution More capabilities, more complexity Team Storm Olivier Aumage Runtime Systems 1. Introduction 4
Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Team Storm Olivier Aumage Runtime Systems 1. Introduction 4
Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing Team Storm Olivier Aumage Runtime Systems 1. Introduction 4
Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing I/O RAID SSD vs Disks Network-attached disks Parallel file systems Team Storm Olivier Aumage Runtime Systems 1. Introduction 4
Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing I/O RAID SSD vs Disks Network-attached disks Parallel file systems Computing Multiprocessors, multicores Vector processing extensions Accelerators Team Storm Olivier Aumage Runtime Systems 1. Introduction 4
Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Team Storm Olivier Aumage Runtime Systems 1. Introduction 5
Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5
Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Team Storm Olivier Aumage Runtime Systems 1. Introduction 5
Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5
Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5
Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5
Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5
Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Vendor-tied code? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5
Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Vendor-tied code? Use runtime systems! Team Storm Olivier Aumage Runtime Systems 1. Introduction 5
The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Team Storm Olivier Aumage Runtime Systems 1. Introduction 6
The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Team Storm Olivier Aumage Runtime Systems 1. Introduction 6
The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Adaptiveness Load balancing Monitoring, sampling, calibrating Team Storm Olivier Aumage Runtime Systems 1. Introduction 6
The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Adaptiveness Load balancing Monitoring, sampling, calibrating Optimization Requests aggregation Resource locality Computation offload Computation/transfer overlap Team Storm Olivier Aumage Runtime Systems 1. Introduction 6
Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Team Storm Olivier Aumage Runtime Systems 1. Introduction 7
Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL Team Storm Olivier Aumage Runtime Systems 1. Introduction 7
Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL I/O MPI-IO Database engines (Google LevelDB) Team Storm Olivier Aumage Runtime Systems 1. Introduction 7
Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL I/O MPI-IO Database engines (Google LevelDB) Computing runtime systems?... Team Storm Olivier Aumage Runtime Systems 1. Introduction 7
2Computing Hardware Team Storm Olivier Aumage Runtime Systems 8
Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 9
Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Hardware Parallelism Multiply existing processing power Have several processing units work together Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 9
Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Hardware Parallelism Multiply existing processing power Have several processing units work together Not a new idea...... but becoming the key performance factor Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 9
Processor Parallelisms Various forms of hardware parallelism Multiprocessors Multicores Hardware multithreading (SMT) Vector processing (SIMD) Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 10
Processor Parallelisms Various forms of hardware parallelism Multiprocessors Multicores Hardware multithreading (SMT) Vector processing (SIMD) Multiple forms may be combined Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 10
Multiprocessors and Multicores Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 11
Multiprocessors and Multicores Multiprocessors Full processor replicates Rationale: Share node contents Share memory and devices Memory sharing may involve non-uniformity See upcoming hwloc and TreeMatch talks...! Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 11
Multiprocessors and Multicores Multiprocessors Full processor replicates Rationale: Share node contents Share memory and devices Memory sharing may involve non-uniformity See upcoming hwloc and TreeMatch talks...! Multicores Processor circuit replicates (cores) printed on the same dye Rationale: Use available dye area for more processing power Shrinking process Share memory and devices May share some additional dye circuitry (cache(s), uncore services) See upcoming hwloc and TreeMatch talks...! Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 11
Multiprocessors and Multicores Taking advantage of them? Needs multiple parallel application activities Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 12
Multiprocessors and Multicores Taking advantage of them? Needs multiple parallel application activities Additional considerations Availability Work mapping issues Locality issues Memory bandwidth issues Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 12
Hardware Multithreading Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 13
Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 13
Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Taking advantage of it? Needs multiple parallel application activities Highly dependent of application activities characteristics Complementary vs competitive Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 13
Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Taking advantage of it? Needs multiple parallel application activities Highly dependent of application activities characteristics Complementary vs competitive Additional considerations Availability Work mapping issues Locality issues Memory bandwidth issues Benefit vs loss Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 13
Vector Processing Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 14
Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 14
Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Taking advantage of it? Specially written kernels Compiler Use of assembly language Intrinsics Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 14
Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Taking advantage of it? Specially written kernels Compiler Use of assembly language Intrinsics Additional considerations Availability Feature set/variants MMX 3dnow! SSE [2...5] AVX... Benefit vs loss Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 14
Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 15
Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Single Instruction Multiple Threads (SIMT) A single control unit...... for several computing units Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 15
Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Streaming Multiprocessor Control R1 + R2 Scalar Cores (Streaming Processors) Single Instruction Multiple Threads (SIMT) A single control unit...... for several computing units Control R5 / R2 Scalar Cores DRAM GPU Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 15
Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Single Instruction Multiple Threads (SIMT) A single control unit...... for several computing units SIMT is distinct from SIMD Allows flows to diverge... but better avoid it! GPU Streaming Multiprocessor Control... if(cond){......... } else {...... }... R1 + R2 Scalar Cores (Streaming Processors) Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 15
GPU Hardware Model CPU CPU vs GPU Multiple strategies for multiple purposes CPU Strategy Large caches Large control Purpose Complex codes, branching Complex memory access patterns World Rally Championship car GPU Strategy Lot of computing power Simplified control Purpose Regular data parallel codes Simple memory access patterns Formula One car Control Cache DRAM DRAM ALU ALU ALU ALU GPU Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 16
GPU Software Model (SIMT) Kernels enclosed in implicit loop Iteration space One kernel instance...... for each space point Threads Execute work simultaneously Specific language NVidia CUDA OpenCL Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 17
GPU Software Model (SIMT) Kernels enclosed in implicit loop Iteration space One kernel instance...... for each space point Threads Execute work simultaneously Specific language NVidia CUDA OpenCL 1 global void 2 vecadd ( f l o a t A, 3 f l o a t B, 4 f l o a t C) { 5 i n t i = threadidx. x ; 6 C[ i ] = A [ i ]+B [ i ] ; 7 } 8 9 i n t 10 main ( ) { 11... 12 / / vecadd <<<1,NB>>> ( A, B,C) ; 13 for ( threadidx. x = 0; 14 threadidx. x < NB; 15 threadidx. x++) { 16 vecadd (A, B,C) ; 17 } 18... 19 } Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 17
GPU Software Model (SIMT) Hardware Abstraction Scalar core Execute instances of a kernel The thread executing a given instance is identified by the threadidx variable { // i = threadidx.x { { { int i = 0; int i = 1; int i = 2; int i = 3; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; } } } } 1 global void 2 vecadd ( f l o a t A, 3 f l o a t B, 4 f l o a t C) { 5 i n t i = threadidx. x ; 6 C[ i ] = A [ i ]+B [ i ] ; 7 } 8 9 i n t 10 main ( ) { 11... 12 / / vecadd <<<1,NB>>> ( A, B,C) ; 13 for ( threadidx. x = 0; 14 threadidx. x < NB; 15 threadidx. x++) { 16 vecadd (A, B,C) ; 17 } 18... 19 } Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 18
Manycores Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19
Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19
Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19
Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19
Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries... but no free lunch Kernels and applications need optimizing work Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19
Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries... but no free lunch Kernels and applications need optimizing work Discrete accelerator cards (for now!) Transfer data to card memory Transfer results back to main memory Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19
3Parallel Programming Models Team Storm Olivier Aumage Runtime Systems 20
Parallel Programming Models Languages Directive-based languages Specialized languages PGAS Languages... Libraries Linear algebra FFT... Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 21
Directive-Based Languages - Cilk Programming environment A language and compiler: keyword-based extension of C An execution model and a run-time system Recursive parallelism Divide-and-Conquer model Initially developed at the MIT Supertech Research Group Charles E. Leiserson s team Mid-90 s Now developed by Intel Available in ICC, GNU GCC Experimental version in LLVM/CLang 1 i n t f i b o ( i n t n ) 2 i n t r ; 3 i f ( n < 2) 4 r = n ; 5 else 6 i n t x, y ; 7 x = f i b o ( n 1) ; 8 y = f i b o ( n 2) ; 9 10 r = x + y ; 11 return r ; Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 22
Directive-Based Languages - Cilk Programming environment A language and compiler: keyword-based extension of C An execution model and a run-time system Recursive parallelism 1 c i l k i n t f i b o ( i n t n ) Divide-and-Conquer model 2 i n t r ; 3 i f ( n < 2) 4 r = n ; 5 Initially developed at the MIT else 6 i n t x, y ; Supertech Research Group 7 spawn x = f i b o ( n 1) ; Charles E. Leiserson s team 8 spawn y = f i b o ( n 2) ; Mid-90 s 9 sync Now developed by Intel 10 r = x + y ; Available in ICC, GNU GCC Experimental version in LLVM/CLang 11 return r ; Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 22
Directive-Based Languages - OpenMP Iterative parallelism Parallel section Team of threads 1 i n t i ; 2 3 4 { 5 6 for ( i = 0; i < N; i ++) { 7 C[ i ] = A [ i ] + B [ i ] ; 8 } 9 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 23
Directive-Based Languages - OpenMP Iterative parallelism Parallel section Team of threads 1 i n t i ; 2 3 #pragma omp p a r a l l e l 4 { 5 #pragma omp for 6 for ( i = 0; i < N; i ++) { 7 C[ i ] = A [ i ] + B [ i ] ; 8 } 9 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 23
Directive-Based Languages - OpenMP Iterative parallelism Parallel section Team of threads Task parallelism, recursive parallelism OpenMP 3.0 Task dependencies, accelerators OpenMP 4.0 1 l i s t p t r = l i s t _ h e a d ; 2 3 4 { 5 6 while ( p t r!= NULL) { 7 void data = p t r >data ; 8 9 10 11 { 12 process ( data ) ; 13 } 14 15 p t r = p t r >next ; 16 } 17 18 19 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 23
Directive-Based Languages - OpenMP Iterative parallelism Parallel section Team of threads Task parallelism, recursive parallelism OpenMP 3.0 Task dependencies, accelerators OpenMP 4.0 1 l i s t p t r = l i s t _ h e a d ; 2 3 #pragma omp p a r a l l e l 4 { 5 #pragma omp single 6 while ( p t r!= NULL) { 7 void data = p t r >data ; 8 9 #pragma omp task \ 10 f i r s t _ p r i v a t e ( data ) 11 { 12 process ( data ) ; 13 } 14 15 p t r = p t r >next ; 16 } 17 18 #pragma omp taskwait 19 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 23
PGAS Languages UPC Partitioned Global Address Space Unified Parallel C Global shared data Data distribution Parallel loops Threads Task extensions UPC Task Library 1 / / 2 3 i n t a [THREADS ] [ THREADS ] ; 4 i n t b [THREADS ] ; 5 i n t c [THREADS ] ; 6 i n t i, j ; 7 8 for ( i =0; i <THREADS; i ++) { 9 c [ i ] = 0 ; 10 for ( j =0; j <THREADS; j ++) { 11 c [ i ] += a [ i ] [ j ] b [ j ] ; 12 } 13 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 24
PGAS Languages UPC Partitioned Global Address Space Unified Parallel C Global shared data Data distribution Parallel loops Threads Task extensions UPC Task Library 1 # include <upc_relaxed. h> 2 3 shared [THREADS] i n t a [THREADS ] [ THREADS ] ; 4 shared i n t b [THREADS ] ; 5 shared i n t c [THREADS ] ; 6 i n t i, j ; 7 8 upc_forall ( i =0; i <THREADS; i ++; i ) { 9 c [ i ] = 0 ; 10 for ( j =0; j <THREADS; j ++) { 11 c [ i ] += a [ i ] [ j ] b [ j ] ; 12 } 13 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 24
Libraries Specialized libraries Black-box parallelism Linear Algebra BLAS, LAPACK Intel MKL, MAGMA, PLASMA Signal Processing FFTW, Spiral... Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 25
Common Denominator Many similar fundamental services Lower-level layer Abstraction/optimization layer Computing Runtime System Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 26
Common Denominator Many similar fundamental services Lower-level layer Abstraction/optimization layer Computing Runtime System Mapping work on computing resources Resolving trade-offs Optimizing Scheduling Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 26
4Computing Runtime Systems Team Storm Olivier Aumage Runtime Systems 27
Computing Runtime Systems Two classes Thread scheduling Task scheduling Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 28
Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 29
Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Examples Nowadays: libpthread Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 29
Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Examples Nowadays: libpthread Discussion Flexibility Resource consumption? Adaptiveness? Synchronization? Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 29
Task Scheduling Task Elementary computation Potential parallel work Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30
Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30
Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Recursive tasks vs non-blocking tasks Dependency management Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30
Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Recursive tasks vs non-blocking tasks Dependency management Examples StarPU Cilk s runtime, Intel Threading Building Blocks (TBB) StarSS / OmpSs PaRSEC... Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30
Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Recursive tasks vs non-blocking tasks Dependency management Examples StarPU Cilk s runtime, Intel Threading Building Blocks (TBB) StarSS / OmpSs PaRSEC... Discussion Abstraction Adaptiveness Transparent synchronization using dependencies Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30
Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 31
Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Communicate with discrete accelerator board(s) Send computation requests Send data to be processed Fetch results back Expensive Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 31
Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Communicate with discrete accelerator board(s) Send computation requests Send data to be processed Fetch results back Expensive Decide about worthiness Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 31
Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Communicate with discrete accelerator board(s) Send computation requests Send data to be processed Fetch results back Expensive Decide about worthiness See StarPU talk Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 31
Computing Runtimes Ecosystem Scheduling and Memory-Management Data transfers: CPU <-> discrete accelerator Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 32
Computing Runtimes Ecosystem Scheduling and Memory-Management Data transfers: CPU <-> discrete accelerator Minimize transfers Overlap transfers and requests with computation Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 32
Computing Runtimes Ecosystem Scheduling and Memory-Management Data transfers: CPU <-> discrete accelerator Minimize transfers Overlap transfers and requests with computation Cooperation with a Distributed Shared Memory system Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 32
Computing Runtimes Ecosystem Scheduling and Networking Distributed computing Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 33
Computing Runtimes Ecosystem Scheduling and Networking Distributed computing Interoperability, minimization, overlap Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 33
Computing Runtimes Ecosystem Scheduling and Networking Distributed computing Interoperability, minimization, overlap Cooperation with a network library MPI, Global Arrays, etc. Anticipate communication needs Merge multiple requests Throttle/alter scheduling with network events Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 33
Computing Runtimes Ecosystem Scheduling and I/O Out-of-core Very large computations Temporary storing large data structures on disk Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 34
Computing Runtimes Ecosystem Scheduling and I/O Out-of-core Very large computations Temporary storing large data structures on disk Interoperability, minimization, overlap Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 34
Computing Runtimes Ecosystem Scheduling and I/O Out-of-core Very large computations Temporary storing large data structures on disk Interoperability, minimization, overlap Cooperation with an I/O library When to store some data on disk? When to fetch it back? Heuristics Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 34
Computing Runtimes Ecosystem Scheduling, and Scheduling Theory Algorithmic Designing scheduling algorithms Testing scheduling algorithms in real life Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 35
Computing Runtimes Ecosystem Scheduling, and Scheduling Theory Algorithmic Designing scheduling algorithms Testing scheduling algorithms in real life Computing Runtimes as an interface framework Plug new algorithms Keep same interface Transparent for application Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 35
Conclusion Runtimes as interface frameworks Portability Control Adaptiveness Optimization Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 36
Conclusion Runtimes as interface frameworks Portability Control Adaptiveness Optimization Portability of Performance Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 36
Program of the Training Session Thursday, June 04: 09:00 (09:30) - 10:00: Introduction to Runtime Systems Olivier Aumage... coffee break... 10:15-12:00: The StarPU computing runtime (Part I) Olivier Aumage, Nathalie Furmento, Samuel Thibault... lunch break... 14:00-16:00: The Eztrace framework for performance debugging (Part I) Matias Hastaran, François Rué Friday, June 05: 09:00-11:00: The hardware locality library (hwloc) Brice Goglin... coffee break... 11:15-12:45: A process placement framework TreeMatch for multicore clusters Emmanuel Jeannot... lunch break... 14:00-16:00: The StarPU computing runtime (Part II) Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 37