Introduction to Runtime Systems

Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation

Contents 1. Introduction 2. Computing Hardware 3. Parallel Programming Models 4. Computing Runtime Systems Team Storm Olivier Aumage Runtime Systems 2

1Introduction Team Storm Olivier Aumage Runtime Systems 3

Hardware Evolution More capabilities, more complexity Team Storm Olivier Aumage Runtime Systems 1. Introduction 4

Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Team Storm Olivier Aumage Runtime Systems 1. Introduction 4

Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing I/O RAID SSD vs Disks Network-attached disks Parallel file systems Team Storm Olivier Aumage Runtime Systems 1. Introduction 4

Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing I/O RAID SSD vs Disks Network-attached disks Parallel file systems Computing Multiprocessors, multicores Vector processing extensions Accelerators Team Storm Olivier Aumage Runtime Systems 1. Introduction 4

Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Vendor-tied code? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Vendor-tied code? Use runtime systems! Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Team Storm Olivier Aumage Runtime Systems 1. Introduction 6

The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Team Storm Olivier Aumage Runtime Systems 1. Introduction 6

The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Adaptiveness Load balancing Monitoring, sampling, calibrating Optimization Requests aggregation Resource locality Computation offload Computation/transfer overlap Team Storm Olivier Aumage Runtime Systems 1. Introduction 6

Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Team Storm Olivier Aumage Runtime Systems 1. Introduction 7

Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL Team Storm Olivier Aumage Runtime Systems 1. Introduction 7

Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL I/O MPI-IO Database engines (Google LevelDB) Team Storm Olivier Aumage Runtime Systems 1. Introduction 7

Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL I/O MPI-IO Database engines (Google LevelDB) Computing runtime systems?... Team Storm Olivier Aumage Runtime Systems 1. Introduction 7

2Computing Hardware Team Storm Olivier Aumage Runtime Systems 8

Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 9

Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Hardware Parallelism Multiply existing processing power Have several processing units work together Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 9

Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Hardware Parallelism Multiply existing processing power Have several processing units work together Not a new idea...... but becoming the key performance factor Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 9

Processor Parallelisms Various forms of hardware parallelism Multiprocessors Multicores Hardware multithreading (SMT) Vector processing (SIMD) Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 10

Processor Parallelisms Various forms of hardware parallelism Multiprocessors Multicores Hardware multithreading (SMT) Vector processing (SIMD) Multiple forms may be combined Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 10

Multiprocessors and Multicores Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 11

Multiprocessors and Multicores Multiprocessors Full processor replicates Rationale: Share node contents Share memory and devices Memory sharing may involve non-uniformity See upcoming hwloc and TreeMatch talks...! Multicores Processor circuit replicates (cores) printed on the same dye Rationale: Use available dye area for more processing power Shrinking process Share memory and devices May share some additional dye circuitry (cache(s), uncore services) See upcoming hwloc and TreeMatch talks...! Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 11

Multiprocessors and Multicores Taking advantage of them? Needs multiple parallel application activities Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 12

Multiprocessors and Multicores Taking advantage of them? Needs multiple parallel application activities Additional considerations Availability Work mapping issues Locality issues Memory bandwidth issues Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 12

Hardware Multithreading Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 13

Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Taking advantage of it? Needs multiple parallel application activities Highly dependent of application activities characteristics Complementary vs competitive Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 13

Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Taking advantage of it? Needs multiple parallel application activities Highly dependent of application activities characteristics Complementary vs competitive Additional considerations Availability Work mapping issues Locality issues Memory bandwidth issues Benefit vs loss Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 13

Vector Processing Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 14

Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Taking advantage of it? Specially written kernels Compiler Use of assembly language Intrinsics Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 14

Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Taking advantage of it? Specially written kernels Compiler Use of assembly language Intrinsics Additional considerations Availability Feature set/variants MMX 3dnow! SSE [2...5] AVX... Benefit vs loss Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 14

Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 15

Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Streaming Multiprocessor Control R1 + R2 Scalar Cores (Streaming Processors) Single Instruction Multiple Threads (SIMT) A single control unit...... for several computing units Control R5 / R2 Scalar Cores DRAM GPU Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 15

Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Single Instruction Multiple Threads (SIMT) A single control unit...... for several computing units SIMT is distinct from SIMD Allows flows to diverge... but better avoid it! GPU Streaming Multiprocessor Control... if(cond){......... } else {...... }... R1 + R2 Scalar Cores (Streaming Processors) Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 15

GPU Hardware Model CPU CPU vs GPU Multiple strategies for multiple purposes CPU Strategy Large caches Large control Purpose Complex codes, branching Complex memory access patterns World Rally Championship car GPU Strategy Lot of computing power Simplified control Purpose Regular data parallel codes Simple memory access patterns Formula One car Control Cache DRAM DRAM ALU ALU ALU ALU GPU Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 16

GPU Software Model (SIMT) Kernels enclosed in implicit loop Iteration space One kernel instance...... for each space point Threads Execute work simultaneously Specific language NVidia CUDA OpenCL 1 global void 2 vecadd ( f l o a t A, 3 f l o a t B, 4 f l o a t C) { 5 i n t i = threadidx. x ; 6 C[ i ] = A [ i ]+B [ i ] ; 7 } 8 9 i n t 10 main ( ) { 11... 12 / / vecadd <<<1,NB>>> ( A, B,C) ; 13 for ( threadidx. x = 0; 14 threadidx. x < NB; 15 threadidx. x++) { 16 vecadd (A, B,C) ; 17 } 18... 19 } Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 17

GPU Software Model (SIMT) Hardware Abstraction Scalar core Execute instances of a kernel The thread executing a given instance is identified by the threadidx variable { // i = threadidx.x { { { int i = 0; int i = 1; int i = 2; int i = 3; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; } } } } 1 global void 2 vecadd ( f l o a t A, 3 f l o a t B, 4 f l o a t C) { 5 i n t i = threadidx. x ; 6 C[ i ] = A [ i ]+B [ i ] ; 7 } 8 9 i n t 10 main ( ) { 11... 12 / / vecadd <<<1,NB>>> ( A, B,C) ; 13 for ( threadidx. x = 0; 14 threadidx. x < NB; 15 threadidx. x++) { 16 vecadd (A, B,C) ; 17 } 18... 19 } Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 18

Manycores Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19

Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19

Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries... but no free lunch Kernels and applications need optimizing work Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19

Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries... but no free lunch Kernels and applications need optimizing work Discrete accelerator cards (for now!) Transfer data to card memory Transfer results back to main memory Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19

3Parallel Programming Models Team Storm Olivier Aumage Runtime Systems 20

Parallel Programming Models Languages Directive-based languages Specialized languages PGAS Languages... Libraries Linear algebra FFT... Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 21

Directive-Based Languages - Cilk Programming environment A language and compiler: keyword-based extension of C An execution model and a run-time system Recursive parallelism Divide-and-Conquer model Initially developed at the MIT Supertech Research Group Charles E. Leiserson s team Mid-90 s Now developed by Intel Available in ICC, GNU GCC Experimental version in LLVM/CLang 1 i n t f i b o ( i n t n ) 2 i n t r ; 3 i f ( n < 2) 4 r = n ; 5 else 6 i n t x, y ; 7 x = f i b o ( n 1) ; 8 y = f i b o ( n 2) ; 9 10 r = x + y ; 11 return r ; Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 22

Directive-Based Languages - Cilk Programming environment A language and compiler: keyword-based extension of C An execution model and a run-time system Recursive parallelism 1 c i l k i n t f i b o ( i n t n ) Divide-and-Conquer model 2 i n t r ; 3 i f ( n < 2) 4 r = n ; 5 Initially developed at the MIT else 6 i n t x, y ; Supertech Research Group 7 spawn x = f i b o ( n 1) ; Charles E. Leiserson s team 8 spawn y = f i b o ( n 2) ; Mid-90 s 9 sync Now developed by Intel 10 r = x + y ; Available in ICC, GNU GCC Experimental version in LLVM/CLang 11 return r ; Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 22

Directive-Based Languages - OpenMP Iterative parallelism Parallel section Team of threads 1 i n t i ; 2 3 4 { 5 6 for ( i = 0; i < N; i ++) { 7 C[ i ] = A [ i ] + B [ i ] ; 8 } 9 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 23

Directive-Based Languages - OpenMP Iterative parallelism Parallel section Team of threads 1 i n t i ; 2 3 #pragma omp p a r a l l e l 4 { 5 #pragma omp for 6 for ( i = 0; i < N; i ++) { 7 C[ i ] = A [ i ] + B [ i ] ; 8 } 9 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 23

Directive-Based Languages - OpenMP Iterative parallelism Parallel section Team of threads Task parallelism, recursive parallelism OpenMP 3.0 Task dependencies, accelerators OpenMP 4.0 1 l i s t p t r = l i s t _ h e a d ; 2 3 4 { 5 6 while ( p t r!= NULL) { 7 void data = p t r >data ; 8 9 10 11 { 12 process ( data ) ; 13 } 14 15 p t r = p t r >next ; 16 } 17 18 19 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 23

Directive-Based Languages - OpenMP Iterative parallelism Parallel section Team of threads Task parallelism, recursive parallelism OpenMP 3.0 Task dependencies, accelerators OpenMP 4.0 1 l i s t p t r = l i s t _ h e a d ; 2 3 #pragma omp p a r a l l e l 4 { 5 #pragma omp single 6 while ( p t r!= NULL) { 7 void data = p t r >data ; 8 9 #pragma omp task \ 10 f i r s t _ p r i v a t e ( data ) 11 { 12 process ( data ) ; 13 } 14 15 p t r = p t r >next ; 16 } 17 18 #pragma omp taskwait 19 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 23

PGAS Languages UPC Partitioned Global Address Space Unified Parallel C Global shared data Data distribution Parallel loops Threads Task extensions UPC Task Library 1 / / 2 3 i n t a [THREADS ] [ THREADS ] ; 4 i n t b [THREADS ] ; 5 i n t c [THREADS ] ; 6 i n t i, j ; 7 8 for ( i =0; i <THREADS; i ++) { 9 c [ i ] = 0 ; 10 for ( j =0; j <THREADS; j ++) { 11 c [ i ] += a [ i ] [ j ] b [ j ] ; 12 } 13 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 24

PGAS Languages UPC Partitioned Global Address Space Unified Parallel C Global shared data Data distribution Parallel loops Threads Task extensions UPC Task Library 1 # include <upc_relaxed. h> 2 3 shared [THREADS] i n t a [THREADS ] [ THREADS ] ; 4 shared i n t b [THREADS ] ; 5 shared i n t c [THREADS ] ; 6 i n t i, j ; 7 8 upc_forall ( i =0; i <THREADS; i ++; i ) { 9 c [ i ] = 0 ; 10 for ( j =0; j <THREADS; j ++) { 11 c [ i ] += a [ i ] [ j ] b [ j ] ; 12 } 13 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 24

Libraries Specialized libraries Black-box parallelism Linear Algebra BLAS, LAPACK Intel MKL, MAGMA, PLASMA Signal Processing FFTW, Spiral... Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 25

Common Denominator Many similar fundamental services Lower-level layer Abstraction/optimization layer Computing Runtime System Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 26

Common Denominator Many similar fundamental services Lower-level layer Abstraction/optimization layer Computing Runtime System Mapping work on computing resources Resolving trade-offs Optimizing Scheduling Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 26

4Computing Runtime Systems Team Storm Olivier Aumage Runtime Systems 27

Computing Runtime Systems Two classes Thread scheduling Task scheduling Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 28

Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 29

Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Examples Nowadays: libpthread Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 29

Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Examples Nowadays: libpthread Discussion Flexibility Resource consumption? Adaptiveness? Synchronization? Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 29

Task Scheduling Task Elementary computation Potential parallel work Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30

Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30

Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Recursive tasks vs non-blocking tasks Dependency management Examples StarPU Cilk s runtime, Intel Threading Building Blocks (TBB) StarSS / OmpSs PaRSEC... Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30

Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Recursive tasks vs non-blocking tasks Dependency management Examples StarPU Cilk s runtime, Intel Threading Building Blocks (TBB) StarSS / OmpSs PaRSEC... Discussion Abstraction Adaptiveness Transparent synchronization using dependencies Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30

Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 31

Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Communicate with discrete accelerator board(s) Send computation requests Send data to be processed Fetch results back Expensive Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 31

Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Communicate with discrete accelerator board(s) Send computation requests Send data to be processed Fetch results back Expensive Decide about worthiness Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 31

Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Communicate with discrete accelerator board(s) Send computation requests Send data to be processed Fetch results back Expensive Decide about worthiness See StarPU talk Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 31

Computing Runtimes Ecosystem Scheduling and Memory-Management Data transfers: CPU <-> discrete accelerator Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 32

Computing Runtimes Ecosystem Scheduling and Memory-Management Data transfers: CPU <-> discrete accelerator Minimize transfers Overlap transfers and requests with computation Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 32

Computing Runtimes Ecosystem Scheduling and Memory-Management Data transfers: CPU <-> discrete accelerator Minimize transfers Overlap transfers and requests with computation Cooperation with a Distributed Shared Memory system Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 32

Computing Runtimes Ecosystem Scheduling and Networking Distributed computing Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 33

Computing Runtimes Ecosystem Scheduling and Networking Distributed computing Interoperability, minimization, overlap Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 33

Computing Runtimes Ecosystem Scheduling and Networking Distributed computing Interoperability, minimization, overlap Cooperation with a network library MPI, Global Arrays, etc. Anticipate communication needs Merge multiple requests Throttle/alter scheduling with network events Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 33

Computing Runtimes Ecosystem Scheduling and I/O Out-of-core Very large computations Temporary storing large data structures on disk Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 34

Computing Runtimes Ecosystem Scheduling and I/O Out-of-core Very large computations Temporary storing large data structures on disk Interoperability, minimization, overlap Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 34

Computing Runtimes Ecosystem Scheduling and I/O Out-of-core Very large computations Temporary storing large data structures on disk Interoperability, minimization, overlap Cooperation with an I/O library When to store some data on disk? When to fetch it back? Heuristics Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 34

Computing Runtimes Ecosystem Scheduling, and Scheduling Theory Algorithmic Designing scheduling algorithms Testing scheduling algorithms in real life Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 35

Computing Runtimes Ecosystem Scheduling, and Scheduling Theory Algorithmic Designing scheduling algorithms Testing scheduling algorithms in real life Computing Runtimes as an interface framework Plug new algorithms Keep same interface Transparent for application Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 35

Conclusion Runtimes as interface frameworks Portability Control Adaptiveness Optimization Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 36

Conclusion Runtimes as interface frameworks Portability Control Adaptiveness Optimization Portability of Performance Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 36

Program of the Training Session Thursday, June 04: 09:00 (09:30) - 10:00: Introduction to Runtime Systems Olivier Aumage... coffee break... 10:15-12:00: The StarPU computing runtime (Part I) Olivier Aumage, Nathalie Furmento, Samuel Thibault... lunch break... 14:00-16:00: The Eztrace framework for performance debugging (Part I) Matias Hastaran, François Rué Friday, June 05: 09:00-11:00: The hardware locality library (hwloc) Brice Goglin... coffee break... 11:15-12:45: A process placement framework TreeMatch for multicore clusters Emmanuel Jeannot... lunch break... 14:00-16:00: The StarPU computing runtime (Part II) Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 37