The StarPU Runtime System
|
|
- Christian Stokes
- 5 years ago
- Views:
Transcription
1 The StarPU Runtime System A Unified Runtime System for Heterogeneous Architectures Olivier Aumage STORM Team Inria LaBRI
2 1Introduction Olivier Aumage STORM Team The StarPU Runtime 2
3 Hardware Evolution More capabilities, more complexity Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 3
4 Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 3
5 Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 3
6 Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing I/O RAID SSD vs Disks Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 3
7 Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing I/O RAID SSD vs Disks Computing Multiprocessors, multicores Vector processing extensions Accelerators Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 3
8 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4
9 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4
10 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4
11 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4
12 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4
13 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4
14 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4
15 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Vendor-tied code? Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4
16 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Vendor-tied code? Use runtime systems! Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4
17 The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 5
18 The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 5
19 The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Adaptiveness Load balancing Monitoring, sampling, calibrating Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 5
20 The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Adaptiveness Load balancing Monitoring, sampling, calibrating Optimization Requests aggregation Resource locality Computation offload Computation/transfer overlap Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 5
21 Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 6
22 Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 6
23 Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL I/O MPI-IO Database engines (Google LevelDB) Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 6
24 Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL I/O MPI-IO Database engines (Google LevelDB) Computing runtime systems?... Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 6
25 Parallel Programming Means Languages Directive-based language extensions Specialized languages PGAS Languages... Libraries Linear algebra FFT... Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 7
26 Common Denominator Many similar fundamental services Lower-level layer Abstraction/optimization layer Computing Runtime System Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 8
27 Common Denominator Many similar fundamental services Lower-level layer Abstraction/optimization layer Computing Runtime System Mapping work on computing resources Resolving trade-offs Optimizing Scheduling Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 8
28 Computing Runtime Systems Two major classes Thread scheduling Task scheduling Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 9
29 Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 10
30 Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Examples Nowadays: libpthread Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 10
31 Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Examples Nowadays: libpthread Discussion Flexibility Resource consumption? Adaptiveness? Synchronization? Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 10
32 Task Scheduling Task Elementary computation Potential parallel work Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 11
33 Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 11
34 Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Single tasks vs multiple-tasks Recursive tasks vs non-blocking tasks Dependency management Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 11
35 Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Single tasks vs multiple-tasks Recursive tasks vs non-blocking tasks Dependency management Examples StarPU GLINDA Cilk s runtime, Intel Threading Building Blocks (TBB) StarSS / OmpSs PaRSEC... Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 11
36 Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Single tasks vs multiple-tasks Recursive tasks vs non-blocking tasks Dependency management Examples StarPU GLINDA Cilk s runtime, Intel Threading Building Blocks (TBB) StarSS / OmpSs PaRSEC... Discussion Abstraction Adaptiveness Transparent synchronization using dependencies Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 11
37 Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 12
38 Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Hardware Parallelism Multiply existing processing power Have several processing units work together Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 12
39 Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Hardware Parallelism Multiply existing processing power Have several processing units work together Not a new idea but becoming the key performance factor Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 12
40 Processor Parallelisms Various forms of hardware parallelism Multiprocessors Multicores Hardware multithreading (SMT) Vector processing (SIMD) Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 13
41 Processor Parallelisms Various forms of hardware parallelism Multiprocessors Multicores Hardware multithreading (SMT) Vector processing (SIMD) Multiple forms may be combined Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 13
42 Multiprocessors and Multicores Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 14
43 Multiprocessors and Multicores Multiprocessors Full processor replicates Rationale: Share node contents Share memory and devices Memory sharing may involve non-uniformity Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 14
44 Multiprocessors and Multicores Multiprocessors Full processor replicates Rationale: Share node contents Share memory and devices Memory sharing may involve non-uniformity Multicores Processor circuit replicates (cores) printed on the same dye Rationale: Use available dye area for more processing power Shrinking process Share memory and devices May share some additional dye circuitry (cache(s), uncore services) Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 14
45 Multiprocessors and Multicores Taking advantage of them? Needs multiple parallel application activities Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 15
46 Multiprocessors and Multicores Taking advantage of them? Needs multiple parallel application activities Additional considerations Availability Work mapping issues Locality issues Memory bandwidth issues Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 15
47 Hardware Multithreading Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 16
48 Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 16
49 Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Taking advantage of it? Needs multiple parallel application activities Highly dependent of application activities characteristics Complementary vs competitive Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 16
50 Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Taking advantage of it? Needs multiple parallel application activities Highly dependent of application activities characteristics Complementary vs competitive Additional considerations Availability Work mapping issues Locality issues Memory bandwidth issues Benefit vs loss Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 16
51 Vector Processing Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 17
52 Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 17
53 Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Taking advantage of it? Specially written kernels Compiler Use of assembly language Intrinsics Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 17
54 Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Taking advantage of it? Specially written kernels Compiler Use of assembly language Intrinsics Additional considerations Availability Feature set/variants MMX 3dnow! SSE [2...5] AVX... Benefit vs loss Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 17
55 Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 18
56 Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Single Instruction Multiple Threads (SIMT) A single control unit for several computing units Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 18
57 Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Streaming Multiprocessor Control R1 + R2 Scalar Cores (Streaming Processors) Single Instruction Multiple Threads (SIMT) A single control unit for several computing units Control R5 / R2 Scalar Cores DRAM GPU Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 18
58 Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Single Instruction Multiple Threads (SIMT) A single control unit for several computing units SIMT is distinct from SIMD Allows flows to diverge... but better avoid it! GPU Streaming Multiprocessor Control... if(cond){ } else { }... R1 + R2 Scalar Cores (Streaming Processors) Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 18
59 GPU Hardware Model CPU CPU vs GPU Multiple strategies for multiple purposes CPU Strategy Large caches Large control Purpose Complex codes, branching Complex memory access patterns World Rally Championship car GPU Strategy Lot of computing power Simplified control Purpose Regular data parallel codes Simple memory access patterns Formula One car Control Cache DRAM DRAM ALU ALU ALU ALU GPU Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 19
60 GPU Software Model (SIMT) Kernels enclosed in implicit loop Iteration space One kernel instance for each space point Threads Execute work simultaneously Specific language NVidia CUDA OpenCL Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 20
61 GPU Software Model (SIMT) Kernels enclosed in implicit loop Iteration space One kernel instance for each space point Threads Execute work simultaneously Specific language NVidia CUDA OpenCL 1 global void 2 vecadd ( f l o a t A, 3 f l o a t B, 4 f l o a t C) { 5 i n t i = threadidx. x ; 6 C[ i ] = A [ i ]+B [ i ] ; 7 } 8 9 i n t 10 main ( ) { / / vecadd <<<1,NB>>> ( A, B,C) ; 13 for ( threadidx. x = 0; 14 threadidx. x < NB; 15 threadidx. x++) { 16 vecadd (A, B,C) ; 17 } } Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 20
62 GPU Software Model (SIMT) Hardware Abstraction Scalar core Execute instances of a kernel The thread executing a given instance is identified by the threadidx variable { // i = threadidx.x { { { int i = 0; int i = 1; int i = 2; int i = 3; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; } } } } 1 global void 2 vecadd ( f l o a t A, 3 f l o a t B, 4 f l o a t C) { 5 i n t i = threadidx. x ; 6 C[ i ] = A [ i ]+B [ i ] ; 7 } 8 9 i n t 10 main ( ) { / / vecadd <<<1,NB>>> ( A, B,C) ; 13 for ( threadidx. x = 0; 14 threadidx. x < NB; 15 threadidx. x++) { 16 vecadd (A, B,C) ; 17 } } Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 21
63 Manycores Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 22
64 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 22
65 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 22
66 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 22
67 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries... but no free lunch Kernels and applications need optimizing work Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 22
68 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries... but no free lunch Kernels and applications need optimizing work Discrete accelerator cards (for now!) Transfer data to card memory Transfer results back to main memory Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 22
69 Heterogeneous Parallel Platforms Heterogeneous Association General purpose processor Specialized accelerator Generalization Combination of various units Latency-optimized cores Throughput-optimized cores Energy-optimized cores Distributed cores Standalone GPUs Intel Xeon Phi (MIC) Intel Single-Chip Cloud (SCC) Integrated cores Intel Haswell AMD Fusion nvidia Tegra Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 23
70 Programming Models for Heterogeneous Platforms? How to Program these architectures? Multicore programming pthreads, OpenMP, TBB,... Accelerator programming Consensus on OpenCL? (Often) Pure offloading model Hybrid models? Take advantage of all resources Complex interactions Multicore Accelerators OpenMP OpenCL CUDA TBB? libspe Cilk? MPI ATI Stream CPU CPU *PU M. CPU CPU M. *PU M. Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 24
71 Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 25
72 Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Communicate with discrete accelerator board(s) Send computation requests Send data to be processed Fetch results back Expensive Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 25
73 Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Communicate with discrete accelerator board(s) Send computation requests Send data to be processed Fetch results back Expensive Decide about worthiness Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 25
74 2StarPU Programming/Execution Models Olivier Aumage STORM Team The StarPU Runtime 26
75 Task Parallelism Principles Input dependencies Computation kernel Output dependencies Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 27
76 Task Parallelism Principles Input dependencies Computation kernel Output dependencies Input dependencies A B Computation kernel A = A+B Output dependencies A Task = an «elementary» computation + dependencies Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 27
77 Task Relationships Abstract Application Structure Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 28
78 Task Relationships Abstract Application Structure Input dependencies A B Computation kernel A = A+B Output dependencies A Task = an «elementary» computation + dependencies Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 28
79 Task Relationships Abstract Application Structure Directed Acyclic Graph (DAG) Input dependencies A B Computation kernel A = A+B Output dependencies A Task = an «elementary» computation + dependencies Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 28
80 StarPU Programming Model: Sequential Task Submission Express parallelism using the natural program flow Submit tasks in the sequential flow of the program let the runtime schedule the tasks asynchronously Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 29
81 Ex.: Sequential Cholesky Decomposition for (j = 0; j < N; j++) { POTRF ( A[j][j]); for (i = j+1; i < N; i++) TRSM ( A[i][j], A[j][j]); for (i = j+1; i < N; i++) { SYRK ( A[i][i], A[i][j]); for (k = j+1; k < i; k++) GEMM ( A[i][k], A[i][j], A[k][j]); } } Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 30
82 Ex.: Task-Based Cholesky Decomposition for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 31
83 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32
84 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32
85 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32
86 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32
87 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32
88 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32
89 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32
90 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32
91 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32
92 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32
93 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32
94 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32
95 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks The graph of tasks is executed POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32
96 StarPU Execution Model: Task Scheduling Mapping the graph of tasks (DAG) on the hardware Allocating computing resources Enforcing dependency constraints Handling data transfers Adaptiveness A single DAG enables multiple schedulings A single DAG can be mapped on multiple platforms Time CPU CPU CPU CPU GPU M. Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 33
97 A Single DAG for Multiple Schedules, Platforms CPU GPU CPU GPU CPU GPU Multicore CPU Multi-GPUs CPU CPU GPU CPU GPU GPU Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 34
98 What StarPU does for You: Heterogeneous Task Scheduling Heterogeneous task mapping using a selectable scheduling algorithm CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 35
99 What StarPU does for You: Heterogeneous Task Scheduling Heterogeneous task mapping using a selectable scheduling algorithm Submit CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 35
100 What StarPU does for You: Heterogeneous Task Scheduling Heterogeneous task mapping using a selectable scheduling algorithm CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 35
101 What StarPU does for You: Heterogeneous Task Scheduling Heterogeneous task mapping using a selectable scheduling algorithm CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 35
102 What StarPU does for You: Heterogeneous Task Scheduling Heterogeneous task mapping using a selectable scheduling algorithm? CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 35
103 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU GPU1 Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36
104 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36
105 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36
106 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36
107 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU Handles dependencies GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36
108 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU Handles dependencies GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36
109 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU Handles dependencies GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36
110 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU Handles dependencies GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36
111 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU Handles dependencies GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36
112 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU Handles dependencies GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36
113 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU Handles dependencies GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36
114 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU GPU1 POTRF Handles dependencies Handles [asynchronous] data transfers TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36
115 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU GPU1 POTRF Handles dependencies Handles [asynchronous] data transfers TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36
116 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU GPU1 POTRF Handles dependencies Handles [asynchronous] data transfers Handles data replication TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36
117 What StarPU does for You: Data Transfers CPU GPU0 MEM RAM GPU0 CPU GPU1 POTRF Handles dependencies Handles [asynchronous] data transfers Handles data replication Handles data consistency (MSI protocol) TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36
118 What StarPU does for You: Data Transfers CPU GPU0 MEM RAM GPU0 CPU GPU1 POTRF Handles dependencies Handles [asynchronous] data transfers Handles data replication Handles data consistency (MSI protocol) TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36
119 What StarPU does for You: Data Transfers CPU GPU0 MEM RAM GPU0 CPU GPU1 POTRF Handles dependencies Handles [asynchronous] data transfers Handles data replication Handles data consistency (MSI protocol) TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36
120 What StarPU does for You: Data Transfers CPU GPU0 MEM RAM GPU0 CPU GPU1 POTRF Handles dependencies Handles [asynchronous] data transfers Handles data replication Handles data consistency (MSI protocol) TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36
121 Showcase with the MAGMA Linear Algebra Library UTK, INRIA HIEPACS, INRIA RUNTIME QR decomposition on 16 CPUs (AMD) + 4 GPUs (C1060) GPUs + 16 CPUs 4 GPUs + 4 CPUs 3 GPUs + 3 CPUs 2 GPUs + 2 CPUs 1 GPUs + 1 CPUs Measured increase: +12 CPUs ~200 GFlops Gflop/s Expected increase: +12 CPUs ~150 Gflops Matrix order Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 37
122 Showcase with the MAGMA Linear Algebra Library QR kernel properties Kernel SGEQRT CPU: 9 GFlop/s GPU: 30 GFlop/s Speed-up: 3 Kernel STSQRT CPU: 12 GFlop/s GPU: 37 GFlop/s Speed-up: 3 Kernel SOMQRT CPU: 8.5 GFlop/s GPU: 227 GFlop/s Speed-up: 27 Kernel SSSMQ CPU: 10 GFlop/s GPU: 285 GFlop/s Speed-up: 28 Consequences Task distribution SGEQRT: 20% Tasks on GPU SSSMQ: 92% tasks on GPU Taking advantage of heterogeneity! Only do what you are good for Don t do what you are not good for Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 38
123 StarPU in a Nutshell Rationale Enable to submit tasks in the natural, sequential flow of the program Map computations on heterogeneous computing units Shipped as a LGPL Library Application Programming Interface Target for compilers, libraries and high-level layers Programming Model Task Data Relationships Task Task Task Data Execution Model Heterogeneous task scheduling Automatic data transferts Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 39
124 3StarPU Programming Example Olivier Aumage STORM Team The StarPU Runtime 40
125 Basic Example: Scaling a Vector 1 f l o a t f a c t o r = ; 2 f l o a t v e c t o r [NX ] ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 41
126 Basic Example: Scaling a Vector 1 f l o a t f a c t o r = ; 2 f l o a t v e c t o r [NX ] ; 3 starpu_data_handle_t vector_ handle ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 42
127 Basic Example: Scaling a Vector 1 f l o a t f a c t o r = ; 2 f l o a t v e c t o r [NX ] ; 3 starpu_data_handle_t vector_ handle ; 4 5 /... f i l l v e c t o r... / 6 7 starpu_vector_data_register (& vector_handle, 0, 8 ( u i n t p t r _ t ) vector, NX, sizeof ( v e c t o r [ 0 ] ) ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 43
128 Basic Example: Scaling a Vector 1 f l o a t f a c t o r = ; 2 f l o a t v e c t o r [NX ] ; 3 starpu_data_handle_t vector_ handle ; 4 5 /... f i l l v e c t o r... / 6 7 starpu_vector_data_register (& vector_handle, 0, 8 ( u i n t p t r _ t ) vector, NX, sizeof ( v e c t o r [ 0 ] ) ) ; 9 10 starpu_task_insert ( 11 &s c a l _ c l, 12 STARPU_RW, vector_handle, 13 STARPU_VALUE, &f a c t o r, sizeof ( f a c t o r ), 14 0) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 44
129 Basic Example: Scaling a Vector 1 f l o a t f a c t o r = ; 2 f l o a t v e c t o r [NX ] ; 3 starpu_data_handle_t vector_ handle ; 4 5 /... f i l l v e c t o r... / 6 7 starpu_vector_data_register (& vector_handle, 0, 8 ( u i n t p t r _ t ) vector, NX, sizeof ( v e c t o r [ 0 ] ) ) ; 9 10 starpu_task_insert ( 11 &s c a l _ c l, 12 STARPU_RW, vector_handle, 13 STARPU_VALUE, &f a c t o r, sizeof ( f a c t o r ), 14 0) ; starpu_task_wait_for_all ( ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 45
130 Basic Example: Scaling a Vector 1 f l o a t f a c t o r = ; 2 f l o a t v e c t o r [NX ] ; 3 starpu_data_handle_t vector_ handle ; 4 5 /... f i l l v e c t o r... / 6 7 starpu_vector_data_register (& vector_handle, 0, 8 ( u i n t p t r _ t ) vector, NX, sizeof ( v e c t o r [ 0 ] ) ) ; 9 10 starpu_task_insert ( 11 &s c a l _ c l, 12 STARPU_RW, vector_handle, 13 STARPU_VALUE, &f a c t o r, sizeof ( f a c t o r ), 14 0) ; starpu_task_wait_for_all ( ) ; 17 starpu_data_unregister ( vector_ handle ) ; /... d i s p l a y v e c t o r... / Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 46
131 Declaring a Codelet to StarPU Define a struct starpu_codelet 1 struct starpu_codelet s c a l _ c l = { } ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 47
132 Declaring a Codelet to StarPU Define a struct starpu_codelet Plug the kernel function Here: scal_cpu_func 1 struct starpu_codelet s c a l _ c l = { 2. cpu_func = { scal_cpu_func, NULL }, } ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 47
133 Declaring a Codelet to StarPU Define a struct starpu_codelet Plug the kernel function Here: scal_cpu_func Declare the number of data pieces used by the kernel Here: A single vector 1 struct starpu_codelet s c a l _ c l = { 2. cpu_func = { scal_cpu_func, NULL }, 3. n b u f f e r s = 1, } ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 47
134 Declaring a Codelet to StarPU Define a struct starpu_codelet Plug the kernel function Here: scal_cpu_func Declare the number of data pieces used by the kernel Here: A single vector Declare how the kernel accesses the piece of data Here: The vector is scaled in-place, thus R/W 1 struct starpu_codelet s c a l _ c l = { 2. cpu_func = { scal_cpu_func, NULL }, 3. n b u f f e r s = 1, 4. modes = { STARPU_RW }, 5 } ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 47
135 Declaring and Managing Data Put data under StarPU control Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 48
136 Declaring and Managing Data Put data under StarPU control Initialize a piece of data 1 f l o a t v e c t o r [NX ] ; 2 /... f i l l data... / Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 48
137 Declaring and Managing Data Put data under StarPU control Initialize a piece of data Register the piece of data and get a handle The vector is now under StarPU control 1 f l o a t v e c t o r [NX ] ; 2 /... f i l l data... / 3 4 starpu_data_handle_t vector_ handle ; 5 starpu_vector_data_register (& vector_handle, 0, 6 ( u i n t p t r _ t ) vector, NX, sizeof ( v e c t or [ 0 ] ) ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 48
138 Declaring and Managing Data Put data under StarPU control Initialize a piece of data Register the piece of data and get a handle The vector is now under StarPU control Use data through the handle 1 f l o a t v e c t o r [NX ] ; 2 /... f i l l data... / 3 4 starpu_data_handle_t vector_ handle ; 5 starpu_vector_data_register (& vector_handle, 0, 6 ( u i n t p t r _ t ) vector, NX, sizeof ( v e c t or [ 0 ] ) ) ; 7 8 /... use the v e c t o r through the handle... / Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 48
139 Declaring and Managing Data Put data under StarPU control Initialize a piece of data Register the piece of data and get a handle The vector is now under StarPU control Use data through the handle Unregister the piece of data The handle is destroyed The vector is now back under user control 1 f l o a t v e c t o r [NX ] ; 2 /... f i l l data... / 3 4 starpu_data_handle_t vector_ handle ; 5 starpu_vector_data_register (& vector_handle, 0, 6 ( u i n t p t r _ t ) vector, NX, sizeof ( v e c t or [ 0 ] ) ) ; 7 8 /... use the v e c t o r through the handle... / 9 10 starpu_data_unregister ( vector_ handle ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 48
140 Writing a Kernel Function Every kernel function has the same C prototype 1 void scal_cpu_func ( void b u f f e r s [ ], void c l _ a r g ) { } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 49
141 Writing a Kernel Function Every kernel function has the same C prototype Retrieve the vector s handle 1 void scal_cpu_func ( void b u f f e r s [ ], void c l _ a r g ) { 2 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 49
142 Writing a Kernel Function Every kernel function has the same C prototype Retrieve the vector s handle Get vector s number of elements and base pointer 1 void scal_cpu_func ( void b u f f e r s [ ], void c l _ a r g ) { 2 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 3 4 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 5 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 49
143 Writing a Kernel Function Every kernel function has the same C prototype Retrieve the vector s handle Get vector s number of elements and base pointer Get the scaling factor as inline argument 1 void scal_cpu_func ( void b u f f e r s [ ], void c l _ a r g ) { 2 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 3 4 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 5 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; 6 7 f l o a t p t r _ f a c t o r = c l _ a r g ; } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 49
144 Writing a Kernel Function Every kernel function has the same C prototype Retrieve the vector s handle Get vector s number of elements and base pointer Get the scaling factor as inline argument Compute the vector scaling 1 void scal_cpu_func ( void b u f f e r s [ ], void c l _ a r g ) { 2 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 3 4 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 5 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; 6 7 f l o a t p t r _ f a c t o r = c l _ a r g ; 8 9 unsigned i ; 10 for ( i = 0; i < n ; i ++) 11 v e ctor [ i ] = p t r _ f a c t o r ; 12 } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 49
145 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50
146 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Arguments The codelet structure 1 starpu_task_insert (& s c a l _ c l 2... ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50
147 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Arguments The codelet structure The StarPU-managed data 1 starpu_task_insert (& s c a l _ c l, 2 STARPU_RW, vector_handle, 3... ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50
148 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Arguments The codelet structure The StarPU-managed data The small-size inline data 1 starpu_task_insert (& s c a l _ c l, 2 STARPU_RW, vector_handle, 3 STARPU_VALUE, &f a c t o r, sizeof ( f a c t o r ), 4... ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50
149 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Arguments The codelet structure The StarPU-managed data The small-size inline data 0 to mark the end of arguments 1 starpu_task_insert (& s c a l _ c l, 2 STARPU_RW, vector_handle, 3 STARPU_VALUE, &f a c t o r, sizeof ( f a c t o r ), 4 0) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50
150 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Arguments The codelet structure The StarPU-managed data The small-size inline data 0 to mark the end of arguments Notes The task is submitted non-blockingly Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50
151 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Arguments The codelet structure The StarPU-managed data The small-size inline data 0 to mark the end of arguments Notes The task is submitted non-blockingly Dependencies are enforced with previously submitted tasks data... Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50
152 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Arguments The codelet structure The StarPU-managed data The small-size inline data 0 to mark the end of arguments Notes The task is submitted non-blockingly Dependencies are enforced with previously submitted tasks data following the natural order of the program Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50
153 Waiting for Submitted Task Completion Tasks are submitted non-blockingly Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 51
154 Waiting for Submitted Task Completion Tasks are submitted non-blockingly 1 / non b l o c k i n g task submits / 2 starpu_task_insert (... ) ; 3 starpu_task_insert (... ) ; 4 starpu_task_insert (... ) ; 5... Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 51
155 Waiting for Submitted Task Completion Tasks are submitted non-blockingly Wait for all submitted tasks to complete their work 1 / non b l o c k i n g task submits / 2 starpu_task_insert (... ) ; 3 starpu_task_insert (... ) ; 4 starpu_task_insert (... ) ; 5... Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 51
156 Waiting for Submitted Task Completion Tasks are submitted non-blockingly Wait for all submitted tasks to complete their work 1 / non b l o c k i n g task submits / 2 starpu_task_insert (... ) ; 3 starpu_task_insert (... ) ; 4 starpu_task_insert (... ) ; / wait f o r a l l task submitted so f a r / 8 s t a r p u _ t a s k _ w a i t _ f o r _ a l l ( ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 51
157 Heterogeneity: Declaring Device-Specific Kernels Extending a codelet to handle heterogeneous platforms Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 52
158 Heterogeneity: Declaring Device-Specific Kernels Extending a codelet to handle heterogeneous platforms Multiple kernel implementations for a CPU SSE, AVX,... optimized kernels 1 struct starpu_codelet s c a l _ c l = { 2. cpu_func = { scal_cpu_func, scal_sse_cpu_func, NULL }, 3. n b u f f e r s = 1, 4. modes = { STARPU_RW }, 5 } ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 52
159 Heterogeneity: Declaring Device-Specific Kernels Extending a codelet to handle heterogeneous platforms Multiple kernel implementations for a CPU SSE, AVX,... optimized kernels Kernels implementations for accelerator devices OpenCL, NVidia Cuda kernels 1 struct starpu_codelet s c a l _ c l = { 2. cpu_func = { scal_cpu_func, scal_sse_cpu_func, NULL }, 3. opencl_func = { scal_cpu_opencl, NULL }, 4. cuda_func = { scal_cpu_cuda, NULL }, 5. n b u f f e r s = 1, 6. modes = { STARPU_RW }, 7 } ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 52
160 Writing a Kernel Function for CUDA Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 53
161 Writing a Kernel Function for CUDA extern "C" void scal_cuda_func ( void b u f f e r s [ ], void c l _ a r g ) { 9 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 10 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 11 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; 12 f l o a t p t r _ f a c t o r = c l _ a r g ; } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 53
162 Writing a Kernel Function for CUDA extern "C" void scal_cuda_func ( void b u f f e r s [ ], void c l _ a r g ) { 9 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 10 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 11 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; 12 f l o a t p t r _ f a c t o r = c l _ a r g ; unsigned threads_ per_ block = 64; 15 unsigned nblocks = ( n+threads_per_block 1) / threads_ per_ block ; } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 53
163 Writing a Kernel Function for CUDA extern "C" void scal_cuda_func ( void b u f f e r s [ ], void c l _ a r g ) { 9 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 10 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 11 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; 12 f l o a t p t r _ f a c t o r = c l _ a r g ; unsigned threads_ per_ block = 64; 15 unsigned nblocks = ( n+threads_per_block 1) / threads_ per_ block ; vector_mult_cuda<<<nblocks, threads_per_block, 0, 18 starpu_cuda_get_local_stream ( ) >>> ( n, vector, p t r _ f a c t o r ) ; 19 } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 53
164 Writing a Kernel Function for CUDA 1 s t a t i c global void vector_mult_cuda ( unsigned n, 2 f l o a t vector, f l o a t f a c t o r ) { 3 unsigned i = b l o c k I d x. x blockdim. x + threadidx. x ; } 7 8 extern "C" void scal_cuda_func ( void b u f f e r s [ ], void c l _ a r g ) { 9 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 10 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 11 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; 12 f l o a t p t r _ f a c t o r = c l _ a r g ; unsigned threads_ per_ block = 64; 15 unsigned nblocks = ( n+threads_per_block 1) / threads_ per_ block ; vector_mult_cuda<<<nblocks, threads_per_block, 0, 18 starpu_cuda_get_local_stream ( ) >>> ( n, vector, p t r _ f a c t o r ) ; 19 } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 53
165 Writing a Kernel Function for CUDA 1 s t a t i c global void vector_mult_cuda ( unsigned n, 2 f l o a t vector, f l o a t f a c t o r ) { 3 unsigned i = b l o c k I d x. x blockdim. x + threadidx. x ; 4 i f ( i < n ) 5 v e c t o r [ i ] = f a c t o r ; 6 } 7 8 extern "C" void scal_cuda_func ( void b u f f e r s [ ], void c l _ a r g ) { 9 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 10 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 11 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; 12 f l o a t p t r _ f a c t o r = c l _ a r g ; unsigned threads_ per_ block = 64; 15 unsigned nblocks = ( n+threads_per_block 1) / threads_ per_ block ; vector_mult_cuda<<<nblocks, threads_per_block, 0, 18 starpu_cuda_get_local_stream ( ) >>> ( n, vector, p t r _ f a c t o r ) ; 19 } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 53
166 4StarPU Internals Olivier Aumage STORM Team The StarPU Runtime 54
167 StarPU Internal Structure HPC Applications High-level data management library Execution model Scheduling engine Specific drivers CPUs GPUs SPUs... Mastering CPUs, GPUs, SPUs *PU StarPU Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 55
168 StarPU Internal Functioning Application Memory Management (DSM) Scheduling engine A B GPU driver CPU driver #k RAM GPU... CPU#k A B Submit task «A+=B» Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56
169 StarPU Internal Functioning Application A = A+B Memory Management (DSM) Scheduling engine A B GPU driver CPU driver #k RAM GPU... CPU#k A B Submit task «A+=B» Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56
170 StarPU Internal Functioning Application Memory Management (DSM) A B A = A+B GPU driver Scheduling engine CPU driver #k RAM GPU... CPU#k A B Schedule task Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56
171 StarPU Internal Functioning Application Memory Management (DSM) A B A = A+B GPU driver Scheduling engine CPU driver #k RAM GPU... CPU#k A B Fetch data Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56
172 StarPU Internal Functioning Application Memory Management (DSM) A B A = A+B GPU driver Scheduling engine CPU driver #k RAM GPU... CPU#k A A B Fetch data Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56
173 StarPU Internal Functioning Application Memory Management (DSM) A B A = A+B GPU driver Scheduling engine CPU driver #k RAM GPU... CPU#k A A B Fetch data Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56
174 StarPU Internal Functioning Application Memory Management (DSM) Scheduling engine A B GPU driver CPU driver #k RAM GPU... A = A+B CPU#k A A B Offload computation Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56
175 StarPU Internal Functioning Application Memory Management (DSM) Scheduling engine A B GPU driver CPU driver #k RAM GPU... CPU#k A A B Notify termination Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56
176 StarPU Scheduling Policies No one size fits all policy Applications may have very different characteristics Applications may have very different scheduling needs Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 57
177 StarPU Scheduling Policies No one size fits all policy Applications may have very different characteristics Applications may have very different scheduling needs Selectable scheduling policy Predefined set of popular policies Eager Work Stealing Priority Performance model based policies Extensible policy set Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 57
178 The Eager Scheduler First come, first served policy Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58
179 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58
180 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58
181 The Eager Scheduler Submit First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58
182 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58
183 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58
184 The Eager Scheduler Submit First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58
185 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58
186 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58
187 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58
188 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58
189 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58
190 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58
191 The Work Stealing Scheduler Load balancing policy Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59
192 The Work Stealing Scheduler Load balancing policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59
193 The Work Stealing Scheduler Load balancing policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59
194 The Work Stealing Scheduler Load balancing policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59
195 The Work Stealing Scheduler Load balancing policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59
196 The Work Stealing Scheduler Load balancing policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59
197 The Work Stealing Scheduler Load balancing policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59
198 The Work Stealing Scheduler Load balancing policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59
199 Selecting a Scheduling Policy Use the STARPU_SCHED environment variable Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 60
200 Selecting a Scheduling Policy Use the STARPU_SCHED environment variable Example 1: selecting the prio scheduler 1 $ export STARPU_SCHED= p r i o 2 $ my_program 3... Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 60
201 Selecting a Scheduling Policy Use the STARPU_SCHED environment variable Example 1: selecting the prio scheduler Example 2: selecting the ws scheduler 1 $ export STARPU_SCHED= p r i o 2 $ my_program $ export STARPU_SCHED=ws 2 $ my_program 3... Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 60
202 Selecting a Scheduling Policy Use the STARPU_SCHED environment variable Example 1: selecting the prio scheduler Example 2: selecting the ws scheduler Example 3: resetting to default scheduler eager 1 $ export STARPU_SCHED= p r i o 2 $ my_program $ export STARPU_SCHED=ws 2 $ my_program $ unset STARPU_SCHED 2 $ my_program 3... Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 60
203 Selecting a Scheduling Policy Use the STARPU_SCHED environment variable Example 1: selecting the prio scheduler Example 2: selecting the ws scheduler Example 3: resetting to default scheduler eager No need to recompile the application 1 $ export STARPU_SCHED= p r i o 2 $ my_program $ export STARPU_SCHED=ws 2 $ my_program $ unset STARPU_SCHED 2 $ my_program 3... Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 60
204 Going Beyond Scheduling is a decision process Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 61
205 Going Beyond Scheduling is a decision process Providing more input to the scheduler... Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 61
206 Going Beyond Scheduling is a decision process Providing more input to the scheduler can lead to better scheduling decisions Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 61
207 Going Beyond Scheduling is a decision process Providing more input to the scheduler can lead to better scheduling decisions What kind of information? Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 61
208 Going Beyond Scheduling is a decision process Providing more input to the scheduler can lead to better scheduling decisions What kind of information? Relative importance of tasks Priorities Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 61
209 Going Beyond Scheduling is a decision process Providing more input to the scheduler can lead to better scheduling decisions What kind of information? Relative importance of tasks Priorities Cost of tasks Codelet models Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 61
210 Going Beyond Scheduling is a decision process Providing more input to the scheduler can lead to better scheduling decisions What kind of information? Relative importance of tasks Priorities Cost of tasks Codelet models Cost of transferring data Bus calibration Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 61
211 The Prio Scheduler Describe the relative importance of tasks Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 62
212 The Prio Scheduler Describe the relative importance of tasks Assign priorities to tasks Values: Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 62
213 The Prio Scheduler Describe the relative importance of tasks Assign priorities to tasks Values: Tell which task matter Tasks that unlock key data pieces Tasks that generate a lot of parallelism Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 62
214 The Prio Scheduler Describe the relative importance of tasks Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63
215 The Prio Scheduler Describe the relative importance of tasks Submit 3 Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63
216 The Prio Scheduler Describe the relative importance of tasks 3 Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63
217 The Prio Scheduler Describe the relative importance of tasks 3 Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63
218 The Prio Scheduler Describe the relative importance of tasks Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63
219 The Prio Scheduler Describe the relative importance of tasks Submit 1 Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63
220 The Prio Scheduler Describe the relative importance of tasks 1 Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63
221 The Prio Scheduler Describe the relative importance of tasks 1 Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63
222 The Prio Scheduler Describe the relative importance of tasks Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63
223 The Prio Scheduler Describe the relative importance of tasks Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63
224 The Prio Scheduler Describe the relative importance of tasks Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63
225 The Prio Scheduler Describe the relative importance of tasks Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63
226 The Deque Model (dm) Scheduler Inspired by HEFT popular scheduling algorithm Heterogeneous Earliest Finish Time Try to get the best from accelerators and CPUs Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 64
227 The Deque Model (dm) Scheduler Inspired by HEFT popular scheduling algorithm Heterogeneous Earliest Finish Time Try to get the best from accelerators and CPUs Using codelet performance models Kernel calibration on each available computing device Raw history model of kernels past execution times Refined models using regression on kernels execution times history Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 64
228 The Deque Model (dm) Scheduler Inspired by HEFT popular scheduling algorithm Heterogeneous Earliest Finish Time Try to get the best from accelerators and CPUs Using codelet performance models Kernel calibration on each available computing device Raw history model of kernels past execution times Refined models using regression on kernels execution times history Model parameter Data size by default User-defined for more complex cases Sparse data structures (e.g. NNZ for instance) Iterative kernels Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 64
229 The Deque Model (dm) Scheduler Using codelet performance models CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65
230 The Deque Model (dm) Scheduler Using codelet performance models Submit CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65
231 The Deque Model (dm) Scheduler Using codelet performance models CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65
232 The Deque Model (dm) Scheduler Using codelet performance models CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65
233 The Deque Model (dm) Scheduler Using codelet performance models? CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65
234 The Deque Model (dm) Scheduler Using codelet performance models? CPU Cores GPU 1 GPU 2 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65
235 The Deque Model (dm) Scheduler Using codelet performance models? Time CPU Cores GPU 1 GPU 2 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65
236 The Deque Model (dm) Scheduler Using codelet performance models? Time CPU Cores GPU 1 GPU 2 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65
237 The Deque Model (dm) Scheduler Using codelet performance models Time CPU Cores GPU 1 GPU 2 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65
238 The Deque Model (dm) Scheduler Using codelet performance models Time CPU Cores GPU 1 GPU 2 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65
239 Data Transfer Modelling Discrete accelerators CPU GPU transfers Data transfer cost vs kernel offload benefit Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 66
240 Data Transfer Modelling Discrete accelerators CPU GPU transfers Data transfer cost vs kernel offload benefit Transfer cost modelling Bus calibration Can differ even for identical devices Platform s topology Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 66
241 Data Transfer Modelling Discrete accelerators CPU GPU transfers Data transfer cost vs kernel offload benefit Transfer cost modelling Bus calibration Can differ even for identical devices Platform s topology Data-transfer aware scheduling Deque Model Data Aware (dmda) policy variants Tunable data transfer cost bias locality vs load balancing Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 66
242 Prefetching Use the DAG to anticipate on required data, overlap transfer with computation Task states Submitted Task inserted by the application Ready Task s dependencies resolved Scheduled Task queued on a computing unit Executing Task running on a computing unit Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 67
243 Prefetching Use the DAG to anticipate on required data, overlap transfer with computation Task states Submitted Task inserted by the application Ready Task s dependencies resolved Scheduled Task queued on a computing unit Executing Task running on a computing unit Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 67
244 Prefetching Use the DAG to anticipate on required data, overlap transfer with computation Task states Submitted Task inserted by the application Ready Task s dependencies resolved Scheduled Task queued on a computing unit Executing Task running on a computing unit Anticipate on the Scheduled Executing transition Prefetch triggered ASAP after Scheduled state Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 67
245 Prefetching Use the DAG to anticipate on required data, overlap transfer with computation Task states Submitted Task inserted by the application Ready Task s dependencies resolved Scheduled Task queued on a computing unit Executing Task running on a computing unit Anticipate on the Scheduled Executing transition Prefetch triggered ASAP after Scheduled state Prefetch may also be triggered by the application Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 67
246 5Extended Features Olivier Aumage STORM Team The StarPU Runtime 68
247 Extended Features Platform Support Distributed Computing: StarPU-MPI Out-of-core support Intel MIC / Xeon Phi support SimGrid support Programming Support OpenMP 4.0 compiler: Klang-OMP OpenCL backend Scheduling Support Composition on multi and many cores Scheduling contexts Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 69
248 Distributed Computing: StarPU-MPI Extending StarPU s Paradigm on Clusters No global scheduler Task Node Mapping Provided by the application through the initial data distribution Can be altered dynamically node0 node1 Irecv node2 node3 node0 node1 Isend node2 node3 Node 1 Node 2 Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 70
249 Communication Requests Nodes infer required transfers Task dependencies Automatic MPI calls Isend Irecv Tasks wait for MPI requests node1 Isend Irecv node0 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 71
250 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72
251 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72
252 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72
253 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72
254 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72
255 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72
256 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72
257 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72
258 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72
259 Support for Manycore Accelerators Manycores Intel xeon phi (MIC) Also: Intel SCC (Single-Chip Cloud) Technical details Support shared for common characteristics Several StarPU instances One CPU instance (the source) One instance per manycore accelerator (the sink) Scheduling performed by the main CPU StarPU instance Separate compilation for CPU code and MIC code Straightforward port Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 73
260 Klang-omp OpenMP C/C++ Compiler High level programming Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 74
261 Klang-omp OpenMP C/C++ Compiler High level programming Translate directives into runtime system API calls StarPU Runtime System XKaapi Runtime System (INRIA Team MOAIS) Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 74
262 Klang-omp OpenMP C/C++ Compiler High level programming Translate directives into runtime system API calls StarPU Runtime System XKaapi Runtime System (INRIA Team MOAIS) OpenMP 3.1 Virtually full support OpenMP 4.0 Dependent tasks Heterogeneous targets (on-going work) Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 74
263 Klang-omp OpenMP C/C++ Compiler High level programming Translate directives into runtime system API calls StarPU Runtime System XKaapi Runtime System (INRIA Team MOAIS) OpenMP 3.1 Virtually full support OpenMP 4.0 Dependent tasks Heterogeneous targets (on-going work) LLVM-based source-to-source compiler Builds on open source Intel compiler clang-omp Available on: K Star project website Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 74
264 Klang-omp Example: Tasks 1 i n t item [N ] ; 2 3 void g ( i n t ) ; 4 5 void f ( ) 6 { 7 #pragma omp p a r a l l e l 8 { 9 #pragma omp single 10 { 11 i n t i ; 12 for ( i =0; i <N; i ++) 13 #pragma omp task unti ed 14 g ( item [ i ] ) ; 15 } 16 } 17 } Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 75
265 Klang-omp Example: Dependencies 1 void f ( ) 2 { 3 i n t a ; 4 5 #pragma omp p a r a l l e l 6 #pragma omp single 7 { 8 #pragma omp task shared ( a ) depend ( out : a ) 9 foo (&a ) ; #pragma omp task shared ( a ) depend ( in : a ) 12 bar (&a ) ; 13 } 14 } Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 76
266 Klang-omp Example: Targets 1 #pragma omp declare target 2 extern void g ( i n t A, i n t B, i n t C) ; 3 #pragma omp end declare target 4 5 i n t A [N], B [N], C[N ] ; 6 7 void f ( ) 8 { 9 i n t a ; 10 #pragma omp p a r a l l e l 11 #pragma omp master 12 #pragma omp target map( to : A, B) map( from : C) 13 g (A, B, C) ; 14 } Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 77
267 SOCL Layer StarPU as an OpenCL Backend High level programming Generic OpenCL Applications StarPU OpenCL layer StarPU Drivers OpenCL CPU cores GPU OpenCL devices Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 78
268 SOCL Layer StarPU as an OpenCL Backend High level programming SOCL Rationale Run generic OpenCL codes on top of StarPU Generic OpenCL Applications StarPU OpenCL layer StarPU Drivers OpenCL CPU cores GPU OpenCL devices Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 78
269 SOCL Layer StarPU as an OpenCL Backend High level programming SOCL Rationale Run generic OpenCL codes on top of StarPU Generic OpenCL Applications StarPU OpenCL layer Technical details StarPU as an OpenCL backend ICD: Installable Client Driver Redirects OpenCL calls to StarPU routines CPU cores StarPU GPU Drivers OpenCL OpenCL devices Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 78
270 SOCL Layer StarPU as an OpenCL Backend High level programming SOCL Rationale Run generic OpenCL codes on top of StarPU Generic OpenCL Applications StarPU OpenCL layer Technical details StarPU as an OpenCL backend ICD: Installable Client Driver Kernels Redirects OpenCL calls to StarPU routines SOCL can itself use OpenCL Kernels CPU cores StarPU GPU Drivers OpenCL OpenCL devices Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 78
271 Composition: Scheduling contexts Rationale Sharing computing resources among multiple DAGs... simultaneously Composing codes, kernels Scheduling contexts Map DAGs on subsets of computing units Isolate competing kernels or library calls OpenMP kernel, Intel MKL, etc. Select scheduling policy per context Context 1 Context 2 Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 79
272 Contexts: Dynamic Resource Management Context 2 Context 1 CPU CPU CPU CPU GPU GPU Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 80
273 Debugging/Analysis Support Online Tools Visual debugging with TEMANEJO High Performance Computing Center Stuttgart (HLRS) Visual task debugging GUI Offline Tools Statistics Performance models Theoretical lower bound Trace-based analysis Gantt Graphviz DAG R plots Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 81
Introduction to Runtime Systems
Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationGetting started with StarPU
1 Getting started with StarPU Cédric Augonnet Nathalie Furmento Samuel Thibault Raymond Namyst INRIA Bordeaux, LaBRI, Université de Bordeaux SAAHPC 2010 UT Tools Tutorial Knoxville 15 th July 2010 The
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More information45-year CPU Evolution: 1 Law -2 Equations
4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationGPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3
/CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationCOMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES
COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:
More informationAddressing Heterogeneity in Manycore Applications
Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationTowards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.
Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing
More informationMulti-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
More informationOpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer
OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationStarPU: a runtime system for multigpu multicore machines
StarPU: a runtime system for multigpu multicore machines Raymond Namyst RUNTIME group, INRIA Bordeaux Journées du Groupe Calcul Lyon, November 2010 The RUNTIME Team High Performance Runtime Systems for
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationProgramming heterogeneous, accelerator-based multicore machines: a runtime system's perspective
Programming heterogeneous, accelerator-based multicore machines: a runtime system's perspective Raymond Namyst University of Bordeaux Head of RUNTIME group, INRIA École CEA-EDF-INRIA 2011 Sophia-Antipolis,
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationParallel Programming on Larrabee. Tim Foley Intel Corp
Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationThe Era of Heterogeneous Computing
The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------
More informationParallel and Distributed Programming Introduction. Kenjiro Taura
Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationProgrammer's View of Execution Teminology Summary
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 28: GP-GPU Programming GPUs Hardware specialized for graphics calculations Originally developed to facilitate the use of CAD programs
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationHierarchical DAG Scheduling for Hybrid Distributed Systems
June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationThinking Outside of the Tera-Scale Box. Piotr Luszczek
Thinking Outside of the Tera-Scale Box Piotr Luszczek Brief History of Tera-flop: 1997 1997 ASCI Red Brief History of Tera-flop: 2007 Intel Polaris 2007 1997 ASCI Red Brief History of Tera-flop: GPGPU
More informationArchitecture, Programming and Performance of MIC Phi Coprocessor
Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationCUDA (Compute Unified Device Architecture)
CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationParallel Systems. Project topics
Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a
More informationAntonio R. Miele Marco D. Santambrogio
Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First
More informationCSC573: TSHA Introduction to Accelerators
CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationThe Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest
The Heterogeneous Programming Jungle Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest June 19, 2012 Outline 1. Introduction 2. Heterogeneous System Zoo 3. Similarities 4. Programming
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationReal-Time Rendering Architectures
Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand
More informationCOMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers
COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu
More informationIntroduction to GPU programming with CUDA
Introduction to GPU programming with CUDA Dr. Juan C Zuniga University of Saskatchewan, WestGrid UBC Summer School, Vancouver. June 12th, 2018 Outline 1 Overview of GPU computing a. what is a GPU? b. GPU
More informationImproving Performance of Machine Learning Workloads
Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationScientific Computing on GPUs: GPU Architecture Overview
Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More information! XKaapi : a runtime for highly parallel (OpenMP) application
XKaapi : a runtime for highly parallel (OpenMP) application Thierry Gautier thierry.gautier@inrialpes.fr MOAIS, INRIA, Grenoble C2S@Exa 10-11/07/2014 Agenda - context - objective - task model and execution
More informationChallenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008
Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated
More informationA low memory footprint OpenCL simulation of short-range particle interactions
A low memory footprint OpenCL simulation of short-range particle interactions Raymond Namyst STORM INRIA Group With Samuel Pitoiset, Inria and Emmanuel Cieren, Laurent Colombet, Laurent Soulard, CEA/DAM/DPTA
More informationCS516 Programming Languages and Compilers II
CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Jan 22 Overview and GPU Programming I Rutgers University CS516 Course Information Staff Instructor: zheng zhang (eddy.zhengzhang@cs.rutgers.edu)
More informationGPU Programming Using NVIDIA CUDA
GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics
More informationChapter 3 Parallel Software
Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationCOSC 6339 Accelerators in Big Data
COSC 6339 Accelerators in Big Data Edgar Gabriel Fall 2018 Motivation Programming models such as MapReduce and Spark provide a high-level view of parallelism not easy for all problems, e.g. recursive algorithms,
More informationPreparing seismic codes for GPUs and other
Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationDealing with Heterogeneous Multicores
Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin
EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationComputing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany
Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been
More informationPerformance of deal.ii on a node
Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationGPU CUDA Programming
GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationParallel Hybrid Computing Stéphane Bihan, CAPS
Parallel Hybrid Computing Stéphane Bihan, CAPS Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous hardware
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance
More informationDesign of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017
Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures
More informationToward a supernodal sparse direct solver over DAG runtimes
Toward a supernodal sparse direct solver over DAG runtimes HOSCAR 2013, Bordeaux X. Lacoste Xavier LACOSTE HiePACS team Inria Bordeaux Sud-Ouest November 27, 2012 Guideline Context and goals About PaStiX
More informationMaster Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationGeneral introduction: GPUs and the realm of parallel architectures
General introduction: GPUs and the realm of parallel architectures GPU Computing Training August 17-19 th 2015 Jan Lemeire (jan.lemeire@vub.ac.be) Graduated as Engineer in 1994 at VUB Worked for 4 years
More information