The StarPU Runtime System

Size: px

Start display at page:

Download "The StarPU Runtime System"

Christian Stokes
5 years ago
Views:

1 The StarPU Runtime System A Unified Runtime System for Heterogeneous Architectures Olivier Aumage STORM Team Inria LaBRI

2 1Introduction Olivier Aumage STORM Team The StarPU Runtime 2

3 Hardware Evolution More capabilities, more complexity Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 3

4 Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 3

5 Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 3

6 Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing I/O RAID SSD vs Disks Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 3

7 Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing I/O RAID SSD vs Disks Computing Multiprocessors, multicores Vector processing extensions Accelerators Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 3

8 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4

9 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4

10 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4

11 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4

12 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4

13 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4

14 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4

15 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Vendor-tied code? Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4

16 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Vendor-tied code? Use runtime systems! Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4

17 The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 5

18 The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 5

19 The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Adaptiveness Load balancing Monitoring, sampling, calibrating Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 5

20 The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Adaptiveness Load balancing Monitoring, sampling, calibrating Optimization Requests aggregation Resource locality Computation offload Computation/transfer overlap Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 5

21 Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 6

22 Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 6

23 Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL I/O MPI-IO Database engines (Google LevelDB) Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 6

24 Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL I/O MPI-IO Database engines (Google LevelDB) Computing runtime systems?... Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 6

25 Parallel Programming Means Languages Directive-based language extensions Specialized languages PGAS Languages... Libraries Linear algebra FFT... Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 7

26 Common Denominator Many similar fundamental services Lower-level layer Abstraction/optimization layer Computing Runtime System Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 8

27 Common Denominator Many similar fundamental services Lower-level layer Abstraction/optimization layer Computing Runtime System Mapping work on computing resources Resolving trade-offs Optimizing Scheduling Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 8

28 Computing Runtime Systems Two major classes Thread scheduling Task scheduling Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 9

29 Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 10

30 Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Examples Nowadays: libpthread Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 10

31 Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Examples Nowadays: libpthread Discussion Flexibility Resource consumption? Adaptiveness? Synchronization? Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 10

32 Task Scheduling Task Elementary computation Potential parallel work Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 11

33 Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 11

34 Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Single tasks vs multiple-tasks Recursive tasks vs non-blocking tasks Dependency management Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 11

35 Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Single tasks vs multiple-tasks Recursive tasks vs non-blocking tasks Dependency management Examples StarPU GLINDA Cilk s runtime, Intel Threading Building Blocks (TBB) StarSS / OmpSs PaRSEC... Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 11

36 Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Single tasks vs multiple-tasks Recursive tasks vs non-blocking tasks Dependency management Examples StarPU GLINDA Cilk s runtime, Intel Threading Building Blocks (TBB) StarSS / OmpSs PaRSEC... Discussion Abstraction Adaptiveness Transparent synchronization using dependencies Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 11

37 Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 12

38 Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Hardware Parallelism Multiply existing processing power Have several processing units work together Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 12

39 Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Hardware Parallelism Multiply existing processing power Have several processing units work together Not a new idea but becoming the key performance factor Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 12

40 Processor Parallelisms Various forms of hardware parallelism Multiprocessors Multicores Hardware multithreading (SMT) Vector processing (SIMD) Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 13

41 Processor Parallelisms Various forms of hardware parallelism Multiprocessors Multicores Hardware multithreading (SMT) Vector processing (SIMD) Multiple forms may be combined Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 13

42 Multiprocessors and Multicores Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 14

43 Multiprocessors and Multicores Multiprocessors Full processor replicates Rationale: Share node contents Share memory and devices Memory sharing may involve non-uniformity Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 14

44 Multiprocessors and Multicores Multiprocessors Full processor replicates Rationale: Share node contents Share memory and devices Memory sharing may involve non-uniformity Multicores Processor circuit replicates (cores) printed on the same dye Rationale: Use available dye area for more processing power Shrinking process Share memory and devices May share some additional dye circuitry (cache(s), uncore services) Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 14

45 Multiprocessors and Multicores Taking advantage of them? Needs multiple parallel application activities Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 15

46 Multiprocessors and Multicores Taking advantage of them? Needs multiple parallel application activities Additional considerations Availability Work mapping issues Locality issues Memory bandwidth issues Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 15

47 Hardware Multithreading Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 16

48 Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 16

49 Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Taking advantage of it? Needs multiple parallel application activities Highly dependent of application activities characteristics Complementary vs competitive Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 16

50 Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Taking advantage of it? Needs multiple parallel application activities Highly dependent of application activities characteristics Complementary vs competitive Additional considerations Availability Work mapping issues Locality issues Memory bandwidth issues Benefit vs loss Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 16

51 Vector Processing Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 17

52 Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 17

53 Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Taking advantage of it? Specially written kernels Compiler Use of assembly language Intrinsics Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 17

54 Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Taking advantage of it? Specially written kernels Compiler Use of assembly language Intrinsics Additional considerations Availability Feature set/variants MMX 3dnow! SSE [2...5] AVX... Benefit vs loss Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 17

55 Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 18

56 Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Single Instruction Multiple Threads (SIMT) A single control unit for several computing units Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 18

(Streaming Processors) Single Instruction Multiple Threads (SIMT) A single control unit.

57 Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Streaming Multiprocessor Control R1 + R2 Scalar Cores (Streaming Processors) Single Instruction Multiple Threads (SIMT) A single control unit for several computing units Control R5 / R2 Scalar Cores DRAM GPU Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 18

Instruction Multiple Threads (SIMT) A single control unit.

.. but better avoid it! GPU Streaming Multiprocessor Control... if(cond){......... } else {.

58 Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Single Instruction Multiple Threads (SIMT) A single control unit for several computing units SIMT is distinct from SIMD Allows flows to diverge... but better avoid it! GPU Streaming Multiprocessor Control... if(cond){ } else { }... R1 + R2 Scalar Cores (Streaming Processors) Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 18

59 GPU Hardware Model CPU CPU vs GPU Multiple strategies for multiple purposes CPU Strategy Large caches Large control Purpose Complex codes, branching Complex memory access patterns World Rally Championship car GPU Strategy Lot of computing power Simplified control Purpose Regular data parallel codes Simple memory access patterns Formula One car Control Cache DRAM DRAM ALU ALU ALU ALU GPU Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 19

60 GPU Software Model (SIMT) Kernels enclosed in implicit loop Iteration space One kernel instance for each space point Threads Execute work simultaneously Specific language NVidia CUDA OpenCL Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 20

61 GPU Software Model (SIMT) Kernels enclosed in implicit loop Iteration space One kernel instance for each space point Threads Execute work simultaneously Specific language NVidia CUDA OpenCL 1 global void 2 vecadd ( f l o a t A, 3 f l o a t B, 4 f l o a t C) { 5 i n t i = threadidx. x ; 6 C[ i ] = A [ i ]+B [ i ] ; 7 } 8 9 i n t 10 main ( ) { / / vecadd <<<1,NB>>> ( A, B,C) ; 13 for ( threadidx. x = 0; 14 threadidx. x < NB; 15 threadidx. x++) { 16 vecadd (A, B,C) ; 17 } } Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 20

GPU Software Model (SIMT) Hardware Abstraction Scalar core Execute instances of a

A[i]+B[i]; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; } } } } 1 global void 2 vecadd ( f l o

x ; 6 C[ i ] = A [ i ]+B [ i ] ; 7 } 8 9 i n t 10 main ( ) { 11.

62 GPU Software Model (SIMT) Hardware Abstraction Scalar core Execute instances of a kernel The thread executing a given instance is identified by the threadidx variable { // i = threadidx.x { { { int i = 0; int i = 1; int i = 2; int i = 3; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; } } } } 1 global void 2 vecadd ( f l o a t A, 3 f l o a t B, 4 f l o a t C) { 5 i n t i = threadidx. x ; 6 C[ i ] = A [ i ]+B [ i ] ; 7 } 8 9 i n t 10 main ( ) { / / vecadd <<<1,NB>>> ( A, B,C) ; 13 for ( threadidx. x = 0; 14 threadidx. x < NB; 15 threadidx. x++) { 16 vecadd (A, B,C) ; 17 } } Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 21

63 Manycores Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 22

64 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 22

65 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 22

66 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 22

67 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries... but no free lunch Kernels and applications need optimizing work Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 22

68 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries... but no free lunch Kernels and applications need optimizing work Discrete accelerator cards (for now!) Transfer data to card memory Transfer results back to main memory Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 22

69 Heterogeneous Parallel Platforms Heterogeneous Association General purpose processor Specialized accelerator Generalization Combination of various units Latency-optimized cores Throughput-optimized cores Energy-optimized cores Distributed cores Standalone GPUs Intel Xeon Phi (MIC) Intel Single-Chip Cloud (SCC) Integrated cores Intel Haswell AMD Fusion nvidia Tegra Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 23

70 Programming Models for Heterogeneous Platforms? How to Program these architectures? Multicore programming pthreads, OpenMP, TBB,... Accelerator programming Consensus on OpenCL? (Often) Pure offloading model Hybrid models? Take advantage of all resources Complex interactions Multicore Accelerators OpenMP OpenCL CUDA TBB? libspe Cilk? MPI ATI Stream CPU CPU *PU M. CPU CPU M. *PU M. Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 24

71 Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 25

72 Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Communicate with discrete accelerator board(s) Send computation requests Send data to be processed Fetch results back Expensive Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 25

73 Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Communicate with discrete accelerator board(s) Send computation requests Send data to be processed Fetch results back Expensive Decide about worthiness Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 25

74 2StarPU Programming/Execution Models Olivier Aumage STORM Team The StarPU Runtime 26

75 Task Parallelism Principles Input dependencies Computation kernel Output dependencies Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 27

dependencies A Task = an «elementary» computation + dependencies Olivier

76 Task Parallelism Principles Input dependencies Computation kernel Output dependencies Input dependencies A B Computation kernel A = A+B Output dependencies A Task = an «elementary» computation + dependencies Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 27

77 Task Relationships Abstract Application Structure Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 28

78 Task Relationships Abstract Application Structure Input dependencies A B Computation kernel A = A+B Output dependencies A Task = an «elementary» computation + dependencies Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 28

79 Task Relationships Abstract Application Structure Directed Acyclic Graph (DAG) Input dependencies A B Computation kernel A = A+B Output dependencies A Task = an «elementary» computation + dependencies Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 28

80 StarPU Programming Model: Sequential Task Submission Express parallelism using the natural program flow Submit tasks in the sequential flow of the program let the runtime schedule the tasks asynchronously Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 29

81 Ex.: Sequential Cholesky Decomposition for (j = 0; j < N; j++) { POTRF ( A[j][j]); for (i = j+1; i < N; i++) TRSM ( A[i][j], A[j][j]); for (i = j+1; i < N; i++) { SYRK ( A[i][i], A[i][j]); for (k = j+1; k < i; k++) GEMM ( A[i][k], A[i][j], A[k][j]); } } Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 30

82 Ex.: Task-Based Cholesky Decomposition for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 31

83 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

84 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

85 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

86 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

87 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

88 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

89 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

90 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

91 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

92 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

93 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

94 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

95 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks The graph of tasks is executed POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

Adaptiveness A single DAG enables multiple schedulings A single DAG can be mapped on multiple

96 StarPU Execution Model: Task Scheduling Mapping the graph of tasks (DAG) on the hardware Allocating computing resources Enforcing dependency constraints Handling data transfers Adaptiveness A single DAG enables multiple schedulings A single DAG can be mapped on multiple platforms Time CPU CPU CPU CPU GPU M. Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 33

97 A Single DAG for Multiple Schedules, Platforms CPU GPU CPU GPU CPU GPU Multicore CPU Multi-GPUs CPU CPU GPU CPU GPU GPU Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 34

98 What StarPU does for You: Heterogeneous Task Scheduling Heterogeneous task mapping using a selectable scheduling algorithm CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 35

99 What StarPU does for You: Heterogeneous Task Scheduling Heterogeneous task mapping using a selectable scheduling algorithm Submit CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 35

100 What StarPU does for You: Heterogeneous Task Scheduling Heterogeneous task mapping using a selectable scheduling algorithm CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 35

101 What StarPU does for You: Heterogeneous Task Scheduling Heterogeneous task mapping using a selectable scheduling algorithm CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 35

102 What StarPU does for You: Heterogeneous Task Scheduling Heterogeneous task mapping using a selectable scheduling algorithm? CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 35

103 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU GPU1 Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

104 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

105 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

106 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

107 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU Handles dependencies GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

108 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU Handles dependencies GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

109 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU Handles dependencies GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

110 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU Handles dependencies GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

111 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU Handles dependencies GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

112 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU Handles dependencies GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

113 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU Handles dependencies GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

114 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU GPU1 POTRF Handles dependencies Handles [asynchronous] data transfers TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

115 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU GPU1 POTRF Handles dependencies Handles [asynchronous] data transfers TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

116 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU GPU1 POTRF Handles dependencies Handles [asynchronous] data transfers Handles data replication TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

117 What StarPU does for You: Data Transfers CPU GPU0 MEM RAM GPU0 CPU GPU1 POTRF Handles dependencies Handles [asynchronous] data transfers Handles data replication Handles data consistency (MSI protocol) TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

118 What StarPU does for You: Data Transfers CPU GPU0 MEM RAM GPU0 CPU GPU1 POTRF Handles dependencies Handles [asynchronous] data transfers Handles data replication Handles data consistency (MSI protocol) TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

119 What StarPU does for You: Data Transfers CPU GPU0 MEM RAM GPU0 CPU GPU1 POTRF Handles dependencies Handles [asynchronous] data transfers Handles data replication Handles data consistency (MSI protocol) TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

120 What StarPU does for You: Data Transfers CPU GPU0 MEM RAM GPU0 CPU GPU1 POTRF Handles dependencies Handles [asynchronous] data transfers Handles data replication Handles data consistency (MSI protocol) TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

121 Showcase with the MAGMA Linear Algebra Library UTK, INRIA HIEPACS, INRIA RUNTIME QR decomposition on 16 CPUs (AMD) + 4 GPUs (C1060) GPUs + 16 CPUs 4 GPUs + 4 CPUs 3 GPUs + 3 CPUs 2 GPUs + 2 CPUs 1 GPUs + 1 CPUs Measured increase: +12 CPUs ~200 GFlops Gflop/s Expected increase: +12 CPUs ~150 Gflops Matrix order Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 37

122 Showcase with the MAGMA Linear Algebra Library QR kernel properties Kernel SGEQRT CPU: 9 GFlop/s GPU: 30 GFlop/s Speed-up: 3 Kernel STSQRT CPU: 12 GFlop/s GPU: 37 GFlop/s Speed-up: 3 Kernel SOMQRT CPU: 8.5 GFlop/s GPU: 227 GFlop/s Speed-up: 27 Kernel SSSMQ CPU: 10 GFlop/s GPU: 285 GFlop/s Speed-up: 28 Consequences Task distribution SGEQRT: 20% Tasks on GPU SSSMQ: 92% tasks on GPU Taking advantage of heterogeneity! Only do what you are good for Don t do what you are not good for Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 38

123 StarPU in a Nutshell Rationale Enable to submit tasks in the natural, sequential flow of the program Map computations on heterogeneous computing units Shipped as a LGPL Library Application Programming Interface Target for compilers, libraries and high-level layers Programming Model Task Data Relationships Task Task Task Data Execution Model Heterogeneous task scheduling Automatic data transferts Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 39

124 3StarPU Programming Example Olivier Aumage STORM Team The StarPU Runtime 40

125 Basic Example: Scaling a Vector 1 f l o a t f a c t o r = ; 2 f l o a t v e c t o r [NX ] ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 41

126 Basic Example: Scaling a Vector 1 f l o a t f a c t o r = ; 2 f l o a t v e c t o r [NX ] ; 3 starpu_data_handle_t vector_ handle ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 42

127 Basic Example: Scaling a Vector 1 f l o a t f a c t o r = ; 2 f l o a t v e c t o r [NX ] ; 3 starpu_data_handle_t vector_ handle ; 4 5 /... f i l l v e c t o r... / 6 7 starpu_vector_data_register (& vector_handle, 0, 8 ( u i n t p t r _ t ) vector, NX, sizeof ( v e c t o r [ 0 ] ) ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 43

128 Basic Example: Scaling a Vector 1 f l o a t f a c t o r = ; 2 f l o a t v e c t o r [NX ] ; 3 starpu_data_handle_t vector_ handle ; 4 5 /... f i l l v e c t o r... / 6 7 starpu_vector_data_register (& vector_handle, 0, 8 ( u i n t p t r _ t ) vector, NX, sizeof ( v e c t o r [ 0 ] ) ) ; 9 10 starpu_task_insert ( 11 &s c a l _ c l, 12 STARPU_RW, vector_handle, 13 STARPU_VALUE, &f a c t o r, sizeof ( f a c t o r ), 14 0) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 44

129 Basic Example: Scaling a Vector 1 f l o a t f a c t o r = ; 2 f l o a t v e c t o r [NX ] ; 3 starpu_data_handle_t vector_ handle ; 4 5 /... f i l l v e c t o r... / 6 7 starpu_vector_data_register (& vector_handle, 0, 8 ( u i n t p t r _ t ) vector, NX, sizeof ( v e c t o r [ 0 ] ) ) ; 9 10 starpu_task_insert ( 11 &s c a l _ c l, 12 STARPU_RW, vector_handle, 13 STARPU_VALUE, &f a c t o r, sizeof ( f a c t o r ), 14 0) ; starpu_task_wait_for_all ( ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 45

130 Basic Example: Scaling a Vector 1 f l o a t f a c t o r = ; 2 f l o a t v e c t o r [NX ] ; 3 starpu_data_handle_t vector_ handle ; 4 5 /... f i l l v e c t o r... / 6 7 starpu_vector_data_register (& vector_handle, 0, 8 ( u i n t p t r _ t ) vector, NX, sizeof ( v e c t o r [ 0 ] ) ) ; 9 10 starpu_task_insert ( 11 &s c a l _ c l, 12 STARPU_RW, vector_handle, 13 STARPU_VALUE, &f a c t o r, sizeof ( f a c t o r ), 14 0) ; starpu_task_wait_for_all ( ) ; 17 starpu_data_unregister ( vector_ handle ) ; /... d i s p l a y v e c t o r... / Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 46

131 Declaring a Codelet to StarPU Define a struct starpu_codelet 1 struct starpu_codelet s c a l _ c l = { } ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 47

132 Declaring a Codelet to StarPU Define a struct starpu_codelet Plug the kernel function Here: scal_cpu_func 1 struct starpu_codelet s c a l _ c l = { 2. cpu_func = { scal_cpu_func, NULL }, } ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 47

133 Declaring a Codelet to StarPU Define a struct starpu_codelet Plug the kernel function Here: scal_cpu_func Declare the number of data pieces used by the kernel Here: A single vector 1 struct starpu_codelet s c a l _ c l = { 2. cpu_func = { scal_cpu_func, NULL }, 3. n b u f f e r s = 1, } ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 47

134 Declaring a Codelet to StarPU Define a struct starpu_codelet Plug the kernel function Here: scal_cpu_func Declare the number of data pieces used by the kernel Here: A single vector Declare how the kernel accesses the piece of data Here: The vector is scaled in-place, thus R/W 1 struct starpu_codelet s c a l _ c l = { 2. cpu_func = { scal_cpu_func, NULL }, 3. n b u f f e r s = 1, 4. modes = { STARPU_RW }, 5 } ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 47

135 Declaring and Managing Data Put data under StarPU control Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 48

136 Declaring and Managing Data Put data under StarPU control Initialize a piece of data 1 f l o a t v e c t o r [NX ] ; 2 /... f i l l data... / Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 48

137 Declaring and Managing Data Put data under StarPU control Initialize a piece of data Register the piece of data and get a handle The vector is now under StarPU control 1 f l o a t v e c t o r [NX ] ; 2 /... f i l l data... / 3 4 starpu_data_handle_t vector_ handle ; 5 starpu_vector_data_register (& vector_handle, 0, 6 ( u i n t p t r _ t ) vector, NX, sizeof ( v e c t or [ 0 ] ) ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 48

138 Declaring and Managing Data Put data under StarPU control Initialize a piece of data Register the piece of data and get a handle The vector is now under StarPU control Use data through the handle 1 f l o a t v e c t o r [NX ] ; 2 /... f i l l data... / 3 4 starpu_data_handle_t vector_ handle ; 5 starpu_vector_data_register (& vector_handle, 0, 6 ( u i n t p t r _ t ) vector, NX, sizeof ( v e c t or [ 0 ] ) ) ; 7 8 /... use the v e c t o r through the handle... / Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 48

139 Declaring and Managing Data Put data under StarPU control Initialize a piece of data Register the piece of data and get a handle The vector is now under StarPU control Use data through the handle Unregister the piece of data The handle is destroyed The vector is now back under user control 1 f l o a t v e c t o r [NX ] ; 2 /... f i l l data... / 3 4 starpu_data_handle_t vector_ handle ; 5 starpu_vector_data_register (& vector_handle, 0, 6 ( u i n t p t r _ t ) vector, NX, sizeof ( v e c t or [ 0 ] ) ) ; 7 8 /... use the v e c t o r through the handle... / 9 10 starpu_data_unregister ( vector_ handle ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 48

140 Writing a Kernel Function Every kernel function has the same C prototype 1 void scal_cpu_func ( void b u f f e r s [ ], void c l _ a r g ) { } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 49

141 Writing a Kernel Function Every kernel function has the same C prototype Retrieve the vector s handle 1 void scal_cpu_func ( void b u f f e r s [ ], void c l _ a r g ) { 2 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 49

142 Writing a Kernel Function Every kernel function has the same C prototype Retrieve the vector s handle Get vector s number of elements and base pointer 1 void scal_cpu_func ( void b u f f e r s [ ], void c l _ a r g ) { 2 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 3 4 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 5 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 49

143 Writing a Kernel Function Every kernel function has the same C prototype Retrieve the vector s handle Get vector s number of elements and base pointer Get the scaling factor as inline argument 1 void scal_cpu_func ( void b u f f e r s [ ], void c l _ a r g ) { 2 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 3 4 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 5 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; 6 7 f l o a t p t r _ f a c t o r = c l _ a r g ; } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 49

144 Writing a Kernel Function Every kernel function has the same C prototype Retrieve the vector s handle Get vector s number of elements and base pointer Get the scaling factor as inline argument Compute the vector scaling 1 void scal_cpu_func ( void b u f f e r s [ ], void c l _ a r g ) { 2 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 3 4 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 5 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; 6 7 f l o a t p t r _ f a c t o r = c l _ a r g ; 8 9 unsigned i ; 10 for ( i = 0; i < n ; i ++) 11 v e ctor [ i ] = p t r _ f a c t o r ; 12 } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 49

145 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50

146 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Arguments The codelet structure 1 starpu_task_insert (& s c a l _ c l 2... ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50

147 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Arguments The codelet structure The StarPU-managed data 1 starpu_task_insert (& s c a l _ c l, 2 STARPU_RW, vector_handle, 3... ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50

148 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Arguments The codelet structure The StarPU-managed data The small-size inline data 1 starpu_task_insert (& s c a l _ c l, 2 STARPU_RW, vector_handle, 3 STARPU_VALUE, &f a c t o r, sizeof ( f a c t o r ), 4... ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50

149 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Arguments The codelet structure The StarPU-managed data The small-size inline data 0 to mark the end of arguments 1 starpu_task_insert (& s c a l _ c l, 2 STARPU_RW, vector_handle, 3 STARPU_VALUE, &f a c t o r, sizeof ( f a c t o r ), 4 0) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50

150 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Arguments The codelet structure The StarPU-managed data The small-size inline data 0 to mark the end of arguments Notes The task is submitted non-blockingly Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50

151 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Arguments The codelet structure The StarPU-managed data The small-size inline data 0 to mark the end of arguments Notes The task is submitted non-blockingly Dependencies are enforced with previously submitted tasks data... Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50

152 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Arguments The codelet structure The StarPU-managed data The small-size inline data 0 to mark the end of arguments Notes The task is submitted non-blockingly Dependencies are enforced with previously submitted tasks data following the natural order of the program Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50

153 Waiting for Submitted Task Completion Tasks are submitted non-blockingly Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 51

154 Waiting for Submitted Task Completion Tasks are submitted non-blockingly 1 / non b l o c k i n g task submits / 2 starpu_task_insert (... ) ; 3 starpu_task_insert (... ) ; 4 starpu_task_insert (... ) ; 5... Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 51

155 Waiting for Submitted Task Completion Tasks are submitted non-blockingly Wait for all submitted tasks to complete their work 1 / non b l o c k i n g task submits / 2 starpu_task_insert (... ) ; 3 starpu_task_insert (... ) ; 4 starpu_task_insert (... ) ; 5... Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 51

156 Waiting for Submitted Task Completion Tasks are submitted non-blockingly Wait for all submitted tasks to complete their work 1 / non b l o c k i n g task submits / 2 starpu_task_insert (... ) ; 3 starpu_task_insert (... ) ; 4 starpu_task_insert (... ) ; / wait f o r a l l task submitted so f a r / 8 s t a r p u _ t a s k _ w a i t _ f o r _ a l l ( ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 51

157 Heterogeneity: Declaring Device-Specific Kernels Extending a codelet to handle heterogeneous platforms Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 52

158 Heterogeneity: Declaring Device-Specific Kernels Extending a codelet to handle heterogeneous platforms Multiple kernel implementations for a CPU SSE, AVX,... optimized kernels 1 struct starpu_codelet s c a l _ c l = { 2. cpu_func = { scal_cpu_func, scal_sse_cpu_func, NULL }, 3. n b u f f e r s = 1, 4. modes = { STARPU_RW }, 5 } ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 52

159 Heterogeneity: Declaring Device-Specific Kernels Extending a codelet to handle heterogeneous platforms Multiple kernel implementations for a CPU SSE, AVX,... optimized kernels Kernels implementations for accelerator devices OpenCL, NVidia Cuda kernels 1 struct starpu_codelet s c a l _ c l = { 2. cpu_func = { scal_cpu_func, scal_sse_cpu_func, NULL }, 3. opencl_func = { scal_cpu_opencl, NULL }, 4. cuda_func = { scal_cpu_cuda, NULL }, 5. n b u f f e r s = 1, 6. modes = { STARPU_RW }, 7 } ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 52

160 Writing a Kernel Function for CUDA Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 53

161 Writing a Kernel Function for CUDA extern "C" void scal_cuda_func ( void b u f f e r s [ ], void c l _ a r g ) { 9 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 10 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 11 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; 12 f l o a t p t r _ f a c t o r = c l _ a r g ; } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 53

162 Writing a Kernel Function for CUDA extern "C" void scal_cuda_func ( void b u f f e r s [ ], void c l _ a r g ) { 9 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 10 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 11 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; 12 f l o a t p t r _ f a c t o r = c l _ a r g ; unsigned threads_ per_ block = 64; 15 unsigned nblocks = ( n+threads_per_block 1) / threads_ per_ block ; } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 53

163 Writing a Kernel Function for CUDA extern "C" void scal_cuda_func ( void b u f f e r s [ ], void c l _ a r g ) { 9 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 10 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 11 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; 12 f l o a t p t r _ f a c t o r = c l _ a r g ; unsigned threads_ per_ block = 64; 15 unsigned nblocks = ( n+threads_per_block 1) / threads_ per_ block ; vector_mult_cuda<<<nblocks, threads_per_block, 0, 18 starpu_cuda_get_local_stream ( ) >>> ( n, vector, p t r _ f a c t o r ) ; 19 } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 53

164 Writing a Kernel Function for CUDA 1 s t a t i c global void vector_mult_cuda ( unsigned n, 2 f l o a t vector, f l o a t f a c t o r ) { 3 unsigned i = b l o c k I d x. x blockdim. x + threadidx. x ; } 7 8 extern "C" void scal_cuda_func ( void b u f f e r s [ ], void c l _ a r g ) { 9 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 10 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 11 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; 12 f l o a t p t r _ f a c t o r = c l _ a r g ; unsigned threads_ per_ block = 64; 15 unsigned nblocks = ( n+threads_per_block 1) / threads_ per_ block ; vector_mult_cuda<<<nblocks, threads_per_block, 0, 18 starpu_cuda_get_local_stream ( ) >>> ( n, vector, p t r _ f a c t o r ) ; 19 } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 53

165 Writing a Kernel Function for CUDA 1 s t a t i c global void vector_mult_cuda ( unsigned n, 2 f l o a t vector, f l o a t f a c t o r ) { 3 unsigned i = b l o c k I d x. x blockdim. x + threadidx. x ; 4 i f ( i < n ) 5 v e c t o r [ i ] = f a c t o r ; 6 } 7 8 extern "C" void scal_cuda_func ( void b u f f e r s [ ], void c l _ a r g ) { 9 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 10 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 11 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; 12 f l o a t p t r _ f a c t o r = c l _ a r g ; unsigned threads_ per_ block = 64; 15 unsigned nblocks = ( n+threads_per_block 1) / threads_ per_ block ; vector_mult_cuda<<<nblocks, threads_per_block, 0, 18 starpu_cuda_get_local_stream ( ) >>> ( n, vector, p t r _ f a c t o r ) ; 19 } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 53

166 4StarPU Internals Olivier Aumage STORM Team The StarPU Runtime 54

167 StarPU Internal Structure HPC Applications High-level data management library Execution model Scheduling engine Specific drivers CPUs GPUs SPUs... Mastering CPUs, GPUs, SPUs *PU StarPU Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 55

168 StarPU Internal Functioning Application Memory Management (DSM) Scheduling engine A B GPU driver CPU driver #k RAM GPU... CPU#k A B Submit task «A+=B» Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56

169 StarPU Internal Functioning Application A = A+B Memory Management (DSM) Scheduling engine A B GPU driver CPU driver #k RAM GPU... CPU#k A B Submit task «A+=B» Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56

170 StarPU Internal Functioning Application Memory Management (DSM) A B A = A+B GPU driver Scheduling engine CPU driver #k RAM GPU... CPU#k A B Schedule task Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56

171 StarPU Internal Functioning Application Memory Management (DSM) A B A = A+B GPU driver Scheduling engine CPU driver #k RAM GPU... CPU#k A B Fetch data Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56

172 StarPU Internal Functioning Application Memory Management (DSM) A B A = A+B GPU driver Scheduling engine CPU driver #k RAM GPU... CPU#k A A B Fetch data Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56

173 StarPU Internal Functioning Application Memory Management (DSM) A B A = A+B GPU driver Scheduling engine CPU driver #k RAM GPU... CPU#k A A B Fetch data Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56

174 StarPU Internal Functioning Application Memory Management (DSM) Scheduling engine A B GPU driver CPU driver #k RAM GPU... A = A+B CPU#k A A B Offload computation Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56

175 StarPU Internal Functioning Application Memory Management (DSM) Scheduling engine A B GPU driver CPU driver #k RAM GPU... CPU#k A A B Notify termination Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56

176 StarPU Scheduling Policies No one size fits all policy Applications may have very different characteristics Applications may have very different scheduling needs Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 57

177 StarPU Scheduling Policies No one size fits all policy Applications may have very different characteristics Applications may have very different scheduling needs Selectable scheduling policy Predefined set of popular policies Eager Work Stealing Priority Performance model based policies Extensible policy set Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 57

178 The Eager Scheduler First come, first served policy Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

179 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

180 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

181 The Eager Scheduler Submit First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

182 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

183 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

184 The Eager Scheduler Submit First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

185 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

186 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

187 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

188 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

189 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

190 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

191 The Work Stealing Scheduler Load balancing policy Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59

192 The Work Stealing Scheduler Load balancing policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59

193 The Work Stealing Scheduler Load balancing policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59

194 The Work Stealing Scheduler Load balancing policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59

195 The Work Stealing Scheduler Load balancing policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59

196 The Work Stealing Scheduler Load balancing policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59

197 The Work Stealing Scheduler Load balancing policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59

198 The Work Stealing Scheduler Load balancing policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59

199 Selecting a Scheduling Policy Use the STARPU_SCHED environment variable Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 60

200 Selecting a Scheduling Policy Use the STARPU_SCHED environment variable Example 1: selecting the prio scheduler 1 $ export STARPU_SCHED= p r i o 2 $ my_program 3... Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 60

201 Selecting a Scheduling Policy Use the STARPU_SCHED environment variable Example 1: selecting the prio scheduler Example 2: selecting the ws scheduler 1 $ export STARPU_SCHED= p r i o 2 $ my_program $ export STARPU_SCHED=ws 2 $ my_program 3... Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 60

202 Selecting a Scheduling Policy Use the STARPU_SCHED environment variable Example 1: selecting the prio scheduler Example 2: selecting the ws scheduler Example 3: resetting to default scheduler eager 1 $ export STARPU_SCHED= p r i o 2 $ my_program $ export STARPU_SCHED=ws 2 $ my_program $ unset STARPU_SCHED 2 $ my_program 3... Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 60

203 Selecting a Scheduling Policy Use the STARPU_SCHED environment variable Example 1: selecting the prio scheduler Example 2: selecting the ws scheduler Example 3: resetting to default scheduler eager No need to recompile the application 1 $ export STARPU_SCHED= p r i o 2 $ my_program $ export STARPU_SCHED=ws 2 $ my_program $ unset STARPU_SCHED 2 $ my_program 3... Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 60

204 Going Beyond Scheduling is a decision process Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 61

205 Going Beyond Scheduling is a decision process Providing more input to the scheduler... Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 61

206 Going Beyond Scheduling is a decision process Providing more input to the scheduler can lead to better scheduling decisions Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 61

207 Going Beyond Scheduling is a decision process Providing more input to the scheduler can lead to better scheduling decisions What kind of information? Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 61

208 Going Beyond Scheduling is a decision process Providing more input to the scheduler can lead to better scheduling decisions What kind of information? Relative importance of tasks Priorities Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 61

209 Going Beyond Scheduling is a decision process Providing more input to the scheduler can lead to better scheduling decisions What kind of information? Relative importance of tasks Priorities Cost of tasks Codelet models Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 61

210 Going Beyond Scheduling is a decision process Providing more input to the scheduler can lead to better scheduling decisions What kind of information? Relative importance of tasks Priorities Cost of tasks Codelet models Cost of transferring data Bus calibration Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 61

211 The Prio Scheduler Describe the relative importance of tasks Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 62

212 The Prio Scheduler Describe the relative importance of tasks Assign priorities to tasks Values: Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 62

213 The Prio Scheduler Describe the relative importance of tasks Assign priorities to tasks Values: Tell which task matter Tasks that unlock key data pieces Tasks that generate a lot of parallelism Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 62

214 The Prio Scheduler Describe the relative importance of tasks Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

215 The Prio Scheduler Describe the relative importance of tasks Submit 3 Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

216 The Prio Scheduler Describe the relative importance of tasks 3 Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

217 The Prio Scheduler Describe the relative importance of tasks 3 Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

218 The Prio Scheduler Describe the relative importance of tasks Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

219 The Prio Scheduler Describe the relative importance of tasks Submit 1 Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

220 The Prio Scheduler Describe the relative importance of tasks 1 Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

221 The Prio Scheduler Describe the relative importance of tasks 1 Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

222 The Prio Scheduler Describe the relative importance of tasks Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

223 The Prio Scheduler Describe the relative importance of tasks Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

224 The Prio Scheduler Describe the relative importance of tasks Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

225 The Prio Scheduler Describe the relative importance of tasks Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

226 The Deque Model (dm) Scheduler Inspired by HEFT popular scheduling algorithm Heterogeneous Earliest Finish Time Try to get the best from accelerators and CPUs Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 64

227 The Deque Model (dm) Scheduler Inspired by HEFT popular scheduling algorithm Heterogeneous Earliest Finish Time Try to get the best from accelerators and CPUs Using codelet performance models Kernel calibration on each available computing device Raw history model of kernels past execution times Refined models using regression on kernels execution times history Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 64

228 The Deque Model (dm) Scheduler Inspired by HEFT popular scheduling algorithm Heterogeneous Earliest Finish Time Try to get the best from accelerators and CPUs Using codelet performance models Kernel calibration on each available computing device Raw history model of kernels past execution times Refined models using regression on kernels execution times history Model parameter Data size by default User-defined for more complex cases Sparse data structures (e.g. NNZ for instance) Iterative kernels Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 64

229 The Deque Model (dm) Scheduler Using codelet performance models CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65

230 The Deque Model (dm) Scheduler Using codelet performance models Submit CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65

231 The Deque Model (dm) Scheduler Using codelet performance models CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65

232 The Deque Model (dm) Scheduler Using codelet performance models CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65

233 The Deque Model (dm) Scheduler Using codelet performance models? CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65

234 The Deque Model (dm) Scheduler Using codelet performance models? CPU Cores GPU 1 GPU 2 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65

235 The Deque Model (dm) Scheduler Using codelet performance models? Time CPU Cores GPU 1 GPU 2 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65

236 The Deque Model (dm) Scheduler Using codelet performance models? Time CPU Cores GPU 1 GPU 2 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65

237 The Deque Model (dm) Scheduler Using codelet performance models Time CPU Cores GPU 1 GPU 2 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65

238 The Deque Model (dm) Scheduler Using codelet performance models Time CPU Cores GPU 1 GPU 2 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65

239 Data Transfer Modelling Discrete accelerators CPU GPU transfers Data transfer cost vs kernel offload benefit Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 66

240 Data Transfer Modelling Discrete accelerators CPU GPU transfers Data transfer cost vs kernel offload benefit Transfer cost modelling Bus calibration Can differ even for identical devices Platform s topology Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 66

241 Data Transfer Modelling Discrete accelerators CPU GPU transfers Data transfer cost vs kernel offload benefit Transfer cost modelling Bus calibration Can differ even for identical devices Platform s topology Data-transfer aware scheduling Deque Model Data Aware (dmda) policy variants Tunable data transfer cost bias locality vs load balancing Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 66

242 Prefetching Use the DAG to anticipate on required data, overlap transfer with computation Task states Submitted Task inserted by the application Ready Task s dependencies resolved Scheduled Task queued on a computing unit Executing Task running on a computing unit Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 67

243 Prefetching Use the DAG to anticipate on required data, overlap transfer with computation Task states Submitted Task inserted by the application Ready Task s dependencies resolved Scheduled Task queued on a computing unit Executing Task running on a computing unit Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 67

244 Prefetching Use the DAG to anticipate on required data, overlap transfer with computation Task states Submitted Task inserted by the application Ready Task s dependencies resolved Scheduled Task queued on a computing unit Executing Task running on a computing unit Anticipate on the Scheduled Executing transition Prefetch triggered ASAP after Scheduled state Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 67

245 Prefetching Use the DAG to anticipate on required data, overlap transfer with computation Task states Submitted Task inserted by the application Ready Task s dependencies resolved Scheduled Task queued on a computing unit Executing Task running on a computing unit Anticipate on the Scheduled Executing transition Prefetch triggered ASAP after Scheduled state Prefetch may also be triggered by the application Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 67

246 5Extended Features Olivier Aumage STORM Team The StarPU Runtime 68

247 Extended Features Platform Support Distributed Computing: StarPU-MPI Out-of-core support Intel MIC / Xeon Phi support SimGrid support Programming Support OpenMP 4.0 compiler: Klang-OMP OpenCL backend Scheduling Support Composition on multi and many cores Scheduling contexts Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 69

248 Distributed Computing: StarPU-MPI Extending StarPU s Paradigm on Clusters No global scheduler Task Node Mapping Provided by the application through the initial data distribution Can be altered dynamically node0 node1 Irecv node2 node3 node0 node1 Isend node2 node3 Node 1 Node 2 Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 70

249 Communication Requests Nodes infer required transfers Task dependencies Automatic MPI calls Isend Irecv Tasks wait for MPI requests node1 Isend Irecv node0 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 71

250 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72

251 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72

252 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72

253 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72

254 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72

255 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72

256 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72

257 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72

258 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72

259 Support for Manycore Accelerators Manycores Intel xeon phi (MIC) Also: Intel SCC (Single-Chip Cloud) Technical details Support shared for common characteristics Several StarPU instances One CPU instance (the source) One instance per manycore accelerator (the sink) Scheduling performed by the main CPU StarPU instance Separate compilation for CPU code and MIC code Straightforward port Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 73

260 Klang-omp OpenMP C/C++ Compiler High level programming Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 74

261 Klang-omp OpenMP C/C++ Compiler High level programming Translate directives into runtime system API calls StarPU Runtime System XKaapi Runtime System (INRIA Team MOAIS) Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 74

262 Klang-omp OpenMP C/C++ Compiler High level programming Translate directives into runtime system API calls StarPU Runtime System XKaapi Runtime System (INRIA Team MOAIS) OpenMP 3.1 Virtually full support OpenMP 4.0 Dependent tasks Heterogeneous targets (on-going work) Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 74

263 Klang-omp OpenMP C/C++ Compiler High level programming Translate directives into runtime system API calls StarPU Runtime System XKaapi Runtime System (INRIA Team MOAIS) OpenMP 3.1 Virtually full support OpenMP 4.0 Dependent tasks Heterogeneous targets (on-going work) LLVM-based source-to-source compiler Builds on open source Intel compiler clang-omp Available on: K Star project website Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 74

264 Klang-omp Example: Tasks 1 i n t item [N ] ; 2 3 void g ( i n t ) ; 4 5 void f ( ) 6 { 7 #pragma omp p a r a l l e l 8 { 9 #pragma omp single 10 { 11 i n t i ; 12 for ( i =0; i <N; i ++) 13 #pragma omp task unti ed 14 g ( item [ i ] ) ; 15 } 16 } 17 } Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 75

265 Klang-omp Example: Dependencies 1 void f ( ) 2 { 3 i n t a ; 4 5 #pragma omp p a r a l l e l 6 #pragma omp single 7 { 8 #pragma omp task shared ( a ) depend ( out : a ) 9 foo (&a ) ; #pragma omp task shared ( a ) depend ( in : a ) 12 bar (&a ) ; 13 } 14 } Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 76

266 Klang-omp Example: Targets 1 #pragma omp declare target 2 extern void g ( i n t A, i n t B, i n t C) ; 3 #pragma omp end declare target 4 5 i n t A [N], B [N], C[N ] ; 6 7 void f ( ) 8 { 9 i n t a ; 10 #pragma omp p a r a l l e l 11 #pragma omp master 12 #pragma omp target map( to : A, B) map( from : C) 13 g (A, B, C) ; 14 } Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 77

267 SOCL Layer StarPU as an OpenCL Backend High level programming Generic OpenCL Applications StarPU OpenCL layer StarPU Drivers OpenCL CPU cores GPU OpenCL devices Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 78

268 SOCL Layer StarPU as an OpenCL Backend High level programming SOCL Rationale Run generic OpenCL codes on top of StarPU Generic OpenCL Applications StarPU OpenCL layer StarPU Drivers OpenCL CPU cores GPU OpenCL devices Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 78

269 SOCL Layer StarPU as an OpenCL Backend High level programming SOCL Rationale Run generic OpenCL codes on top of StarPU Generic OpenCL Applications StarPU OpenCL layer Technical details StarPU as an OpenCL backend ICD: Installable Client Driver Redirects OpenCL calls to StarPU routines CPU cores StarPU GPU Drivers OpenCL OpenCL devices Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 78

270 SOCL Layer StarPU as an OpenCL Backend High level programming SOCL Rationale Run generic OpenCL codes on top of StarPU Generic OpenCL Applications StarPU OpenCL layer Technical details StarPU as an OpenCL backend ICD: Installable Client Driver Kernels Redirects OpenCL calls to StarPU routines SOCL can itself use OpenCL Kernels CPU cores StarPU GPU Drivers OpenCL OpenCL devices Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 78

Composition: Scheduling contexts Rationale

.. simultaneously Composing codes, kernels

computing units Isolate competing kernels or

Select scheduling policy per context Context 1

271 Composition: Scheduling contexts Rationale Sharing computing resources among multiple DAGs... simultaneously Composing codes, kernels Scheduling contexts Map DAGs on subsets of computing units Isolate competing kernels or library calls OpenMP kernel, Intel MKL, etc. Select scheduling policy per context Context 1 Context 2 Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 79

272 Contexts: Dynamic Resource Management Context 2 Context 1 CPU CPU CPU CPU GPU GPU Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 80

Debugging/Analysis Support Online Tools Visual debugging with TEMANEJO High Performance Computing Center Stuttgart (HLRS) Visual task debugging GUI Offline Tools

273 Debugging/Analysis Support Online Tools Visual debugging with TEMANEJO High Performance Computing Center Stuttgart (HLRS) Visual task debugging GUI Offline Tools Statistics Performance models Theoretical lower bound Trace-based analysis Gantt Graphviz DAG R plots Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 81

Introduction to Runtime Systems

Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents