The StarPU Runtime System

Size: px
Start display at page:

Download "The StarPU Runtime System"

Transcription

1 The StarPU Runtime System A Unified Runtime System for Heterogeneous Architectures Olivier Aumage STORM Team Inria LaBRI

2 1Introduction Olivier Aumage STORM Team The StarPU Runtime 2

3 Hardware Evolution More capabilities, more complexity Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 3

4 Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 3

5 Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 3

6 Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing I/O RAID SSD vs Disks Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 3

7 Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing I/O RAID SSD vs Disks Computing Multiprocessors, multicores Vector processing extensions Accelerators Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 3

8 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4

9 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4

10 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4

11 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4

12 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4

13 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4

14 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4

15 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Vendor-tied code? Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4

16 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Vendor-tied code? Use runtime systems! Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 4

17 The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 5

18 The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 5

19 The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Adaptiveness Load balancing Monitoring, sampling, calibrating Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 5

20 The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Adaptiveness Load balancing Monitoring, sampling, calibrating Optimization Requests aggregation Resource locality Computation offload Computation/transfer overlap Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 5

21 Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 6

22 Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 6

23 Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL I/O MPI-IO Database engines (Google LevelDB) Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 6

24 Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL I/O MPI-IO Database engines (Google LevelDB) Computing runtime systems?... Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 6

25 Parallel Programming Means Languages Directive-based language extensions Specialized languages PGAS Languages... Libraries Linear algebra FFT... Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 7

26 Common Denominator Many similar fundamental services Lower-level layer Abstraction/optimization layer Computing Runtime System Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 8

27 Common Denominator Many similar fundamental services Lower-level layer Abstraction/optimization layer Computing Runtime System Mapping work on computing resources Resolving trade-offs Optimizing Scheduling Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 8

28 Computing Runtime Systems Two major classes Thread scheduling Task scheduling Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 9

29 Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 10

30 Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Examples Nowadays: libpthread Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 10

31 Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Examples Nowadays: libpthread Discussion Flexibility Resource consumption? Adaptiveness? Synchronization? Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 10

32 Task Scheduling Task Elementary computation Potential parallel work Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 11

33 Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 11

34 Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Single tasks vs multiple-tasks Recursive tasks vs non-blocking tasks Dependency management Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 11

35 Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Single tasks vs multiple-tasks Recursive tasks vs non-blocking tasks Dependency management Examples StarPU GLINDA Cilk s runtime, Intel Threading Building Blocks (TBB) StarSS / OmpSs PaRSEC... Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 11

36 Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Single tasks vs multiple-tasks Recursive tasks vs non-blocking tasks Dependency management Examples StarPU GLINDA Cilk s runtime, Intel Threading Building Blocks (TBB) StarSS / OmpSs PaRSEC... Discussion Abstraction Adaptiveness Transparent synchronization using dependencies Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 11

37 Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 12

38 Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Hardware Parallelism Multiply existing processing power Have several processing units work together Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 12

39 Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Hardware Parallelism Multiply existing processing power Have several processing units work together Not a new idea but becoming the key performance factor Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 12

40 Processor Parallelisms Various forms of hardware parallelism Multiprocessors Multicores Hardware multithreading (SMT) Vector processing (SIMD) Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 13

41 Processor Parallelisms Various forms of hardware parallelism Multiprocessors Multicores Hardware multithreading (SMT) Vector processing (SIMD) Multiple forms may be combined Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 13

42 Multiprocessors and Multicores Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 14

43 Multiprocessors and Multicores Multiprocessors Full processor replicates Rationale: Share node contents Share memory and devices Memory sharing may involve non-uniformity Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 14

44 Multiprocessors and Multicores Multiprocessors Full processor replicates Rationale: Share node contents Share memory and devices Memory sharing may involve non-uniformity Multicores Processor circuit replicates (cores) printed on the same dye Rationale: Use available dye area for more processing power Shrinking process Share memory and devices May share some additional dye circuitry (cache(s), uncore services) Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 14

45 Multiprocessors and Multicores Taking advantage of them? Needs multiple parallel application activities Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 15

46 Multiprocessors and Multicores Taking advantage of them? Needs multiple parallel application activities Additional considerations Availability Work mapping issues Locality issues Memory bandwidth issues Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 15

47 Hardware Multithreading Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 16

48 Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 16

49 Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Taking advantage of it? Needs multiple parallel application activities Highly dependent of application activities characteristics Complementary vs competitive Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 16

50 Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Taking advantage of it? Needs multiple parallel application activities Highly dependent of application activities characteristics Complementary vs competitive Additional considerations Availability Work mapping issues Locality issues Memory bandwidth issues Benefit vs loss Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 16

51 Vector Processing Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 17

52 Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 17

53 Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Taking advantage of it? Specially written kernels Compiler Use of assembly language Intrinsics Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 17

54 Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Taking advantage of it? Specially written kernels Compiler Use of assembly language Intrinsics Additional considerations Availability Feature set/variants MMX 3dnow! SSE [2...5] AVX... Benefit vs loss Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 17

55 Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 18

56 Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Single Instruction Multiple Threads (SIMT) A single control unit for several computing units Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 18

57 Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Streaming Multiprocessor Control R1 + R2 Scalar Cores (Streaming Processors) Single Instruction Multiple Threads (SIMT) A single control unit for several computing units Control R5 / R2 Scalar Cores DRAM GPU Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 18

58 Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Single Instruction Multiple Threads (SIMT) A single control unit for several computing units SIMT is distinct from SIMD Allows flows to diverge... but better avoid it! GPU Streaming Multiprocessor Control... if(cond){ } else { }... R1 + R2 Scalar Cores (Streaming Processors) Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 18

59 GPU Hardware Model CPU CPU vs GPU Multiple strategies for multiple purposes CPU Strategy Large caches Large control Purpose Complex codes, branching Complex memory access patterns World Rally Championship car GPU Strategy Lot of computing power Simplified control Purpose Regular data parallel codes Simple memory access patterns Formula One car Control Cache DRAM DRAM ALU ALU ALU ALU GPU Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 19

60 GPU Software Model (SIMT) Kernels enclosed in implicit loop Iteration space One kernel instance for each space point Threads Execute work simultaneously Specific language NVidia CUDA OpenCL Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 20

61 GPU Software Model (SIMT) Kernels enclosed in implicit loop Iteration space One kernel instance for each space point Threads Execute work simultaneously Specific language NVidia CUDA OpenCL 1 global void 2 vecadd ( f l o a t A, 3 f l o a t B, 4 f l o a t C) { 5 i n t i = threadidx. x ; 6 C[ i ] = A [ i ]+B [ i ] ; 7 } 8 9 i n t 10 main ( ) { / / vecadd <<<1,NB>>> ( A, B,C) ; 13 for ( threadidx. x = 0; 14 threadidx. x < NB; 15 threadidx. x++) { 16 vecadd (A, B,C) ; 17 } } Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 20

62 GPU Software Model (SIMT) Hardware Abstraction Scalar core Execute instances of a kernel The thread executing a given instance is identified by the threadidx variable { // i = threadidx.x { { { int i = 0; int i = 1; int i = 2; int i = 3; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; } } } } 1 global void 2 vecadd ( f l o a t A, 3 f l o a t B, 4 f l o a t C) { 5 i n t i = threadidx. x ; 6 C[ i ] = A [ i ]+B [ i ] ; 7 } 8 9 i n t 10 main ( ) { / / vecadd <<<1,NB>>> ( A, B,C) ; 13 for ( threadidx. x = 0; 14 threadidx. x < NB; 15 threadidx. x++) { 16 vecadd (A, B,C) ; 17 } } Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 21

63 Manycores Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 22

64 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 22

65 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 22

66 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 22

67 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries... but no free lunch Kernels and applications need optimizing work Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 22

68 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries... but no free lunch Kernels and applications need optimizing work Discrete accelerator cards (for now!) Transfer data to card memory Transfer results back to main memory Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 22

69 Heterogeneous Parallel Platforms Heterogeneous Association General purpose processor Specialized accelerator Generalization Combination of various units Latency-optimized cores Throughput-optimized cores Energy-optimized cores Distributed cores Standalone GPUs Intel Xeon Phi (MIC) Intel Single-Chip Cloud (SCC) Integrated cores Intel Haswell AMD Fusion nvidia Tegra Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 23

70 Programming Models for Heterogeneous Platforms? How to Program these architectures? Multicore programming pthreads, OpenMP, TBB,... Accelerator programming Consensus on OpenCL? (Often) Pure offloading model Hybrid models? Take advantage of all resources Complex interactions Multicore Accelerators OpenMP OpenCL CUDA TBB? libspe Cilk? MPI ATI Stream CPU CPU *PU M. CPU CPU M. *PU M. Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 24

71 Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 25

72 Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Communicate with discrete accelerator board(s) Send computation requests Send data to be processed Fetch results back Expensive Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 25

73 Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Communicate with discrete accelerator board(s) Send computation requests Send data to be processed Fetch results back Expensive Decide about worthiness Olivier Aumage STORM Team The StarPU Runtime 1. Introduction 25

74 2StarPU Programming/Execution Models Olivier Aumage STORM Team The StarPU Runtime 26

75 Task Parallelism Principles Input dependencies Computation kernel Output dependencies Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 27

76 Task Parallelism Principles Input dependencies Computation kernel Output dependencies Input dependencies A B Computation kernel A = A+B Output dependencies A Task = an «elementary» computation + dependencies Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 27

77 Task Relationships Abstract Application Structure Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 28

78 Task Relationships Abstract Application Structure Input dependencies A B Computation kernel A = A+B Output dependencies A Task = an «elementary» computation + dependencies Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 28

79 Task Relationships Abstract Application Structure Directed Acyclic Graph (DAG) Input dependencies A B Computation kernel A = A+B Output dependencies A Task = an «elementary» computation + dependencies Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 28

80 StarPU Programming Model: Sequential Task Submission Express parallelism using the natural program flow Submit tasks in the sequential flow of the program let the runtime schedule the tasks asynchronously Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 29

81 Ex.: Sequential Cholesky Decomposition for (j = 0; j < N; j++) { POTRF ( A[j][j]); for (i = j+1; i < N; i++) TRSM ( A[i][j], A[j][j]); for (i = j+1; i < N; i++) { SYRK ( A[i][i], A[i][j]); for (k = j+1; k < i; k++) GEMM ( A[i][k], A[i][j], A[k][j]); } } Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 30

82 Ex.: Task-Based Cholesky Decomposition for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 31

83 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

84 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

85 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

86 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

87 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

88 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

89 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

90 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

91 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

92 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

93 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

94 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

95 Dynamic Task Graph Building for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } wait (); Tasks are submitted asynchronously at run-time Data references are annotated StarPU infers data dependences and builds a graph of tasks The graph of tasks is executed POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 32

96 StarPU Execution Model: Task Scheduling Mapping the graph of tasks (DAG) on the hardware Allocating computing resources Enforcing dependency constraints Handling data transfers Adaptiveness A single DAG enables multiple schedulings A single DAG can be mapped on multiple platforms Time CPU CPU CPU CPU GPU M. Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 33

97 A Single DAG for Multiple Schedules, Platforms CPU GPU CPU GPU CPU GPU Multicore CPU Multi-GPUs CPU CPU GPU CPU GPU GPU Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 34

98 What StarPU does for You: Heterogeneous Task Scheduling Heterogeneous task mapping using a selectable scheduling algorithm CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 35

99 What StarPU does for You: Heterogeneous Task Scheduling Heterogeneous task mapping using a selectable scheduling algorithm Submit CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 35

100 What StarPU does for You: Heterogeneous Task Scheduling Heterogeneous task mapping using a selectable scheduling algorithm CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 35

101 What StarPU does for You: Heterogeneous Task Scheduling Heterogeneous task mapping using a selectable scheduling algorithm CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 35

102 What StarPU does for You: Heterogeneous Task Scheduling Heterogeneous task mapping using a selectable scheduling algorithm? CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 35

103 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU GPU1 Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

104 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

105 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

106 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

107 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU Handles dependencies GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

108 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU Handles dependencies GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

109 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU Handles dependencies GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

110 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU Handles dependencies GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

111 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU Handles dependencies GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

112 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU Handles dependencies GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

113 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU Handles dependencies GPU1 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

114 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU GPU1 POTRF Handles dependencies Handles [asynchronous] data transfers TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

115 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU GPU1 POTRF Handles dependencies Handles [asynchronous] data transfers TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

116 What StarPU does for You: Data Transfers CPU GPU0 MEM CPU GPU1 POTRF Handles dependencies Handles [asynchronous] data transfers Handles data replication TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

117 What StarPU does for You: Data Transfers CPU GPU0 MEM RAM GPU0 CPU GPU1 POTRF Handles dependencies Handles [asynchronous] data transfers Handles data replication Handles data consistency (MSI protocol) TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

118 What StarPU does for You: Data Transfers CPU GPU0 MEM RAM GPU0 CPU GPU1 POTRF Handles dependencies Handles [asynchronous] data transfers Handles data replication Handles data consistency (MSI protocol) TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

119 What StarPU does for You: Data Transfers CPU GPU0 MEM RAM GPU0 CPU GPU1 POTRF Handles dependencies Handles [asynchronous] data transfers Handles data replication Handles data consistency (MSI protocol) TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

120 What StarPU does for You: Data Transfers CPU GPU0 MEM RAM GPU0 CPU GPU1 POTRF Handles dependencies Handles [asynchronous] data transfers Handles data replication Handles data consistency (MSI protocol) TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 36

121 Showcase with the MAGMA Linear Algebra Library UTK, INRIA HIEPACS, INRIA RUNTIME QR decomposition on 16 CPUs (AMD) + 4 GPUs (C1060) GPUs + 16 CPUs 4 GPUs + 4 CPUs 3 GPUs + 3 CPUs 2 GPUs + 2 CPUs 1 GPUs + 1 CPUs Measured increase: +12 CPUs ~200 GFlops Gflop/s Expected increase: +12 CPUs ~150 Gflops Matrix order Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 37

122 Showcase with the MAGMA Linear Algebra Library QR kernel properties Kernel SGEQRT CPU: 9 GFlop/s GPU: 30 GFlop/s Speed-up: 3 Kernel STSQRT CPU: 12 GFlop/s GPU: 37 GFlop/s Speed-up: 3 Kernel SOMQRT CPU: 8.5 GFlop/s GPU: 227 GFlop/s Speed-up: 27 Kernel SSSMQ CPU: 10 GFlop/s GPU: 285 GFlop/s Speed-up: 28 Consequences Task distribution SGEQRT: 20% Tasks on GPU SSSMQ: 92% tasks on GPU Taking advantage of heterogeneity! Only do what you are good for Don t do what you are not good for Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 38

123 StarPU in a Nutshell Rationale Enable to submit tasks in the natural, sequential flow of the program Map computations on heterogeneous computing units Shipped as a LGPL Library Application Programming Interface Target for compilers, libraries and high-level layers Programming Model Task Data Relationships Task Task Task Data Execution Model Heterogeneous task scheduling Automatic data transferts Olivier Aumage STORM Team The StarPU Runtime 2. StarPU Programming/Execution Models 39

124 3StarPU Programming Example Olivier Aumage STORM Team The StarPU Runtime 40

125 Basic Example: Scaling a Vector 1 f l o a t f a c t o r = ; 2 f l o a t v e c t o r [NX ] ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 41

126 Basic Example: Scaling a Vector 1 f l o a t f a c t o r = ; 2 f l o a t v e c t o r [NX ] ; 3 starpu_data_handle_t vector_ handle ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 42

127 Basic Example: Scaling a Vector 1 f l o a t f a c t o r = ; 2 f l o a t v e c t o r [NX ] ; 3 starpu_data_handle_t vector_ handle ; 4 5 /... f i l l v e c t o r... / 6 7 starpu_vector_data_register (& vector_handle, 0, 8 ( u i n t p t r _ t ) vector, NX, sizeof ( v e c t o r [ 0 ] ) ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 43

128 Basic Example: Scaling a Vector 1 f l o a t f a c t o r = ; 2 f l o a t v e c t o r [NX ] ; 3 starpu_data_handle_t vector_ handle ; 4 5 /... f i l l v e c t o r... / 6 7 starpu_vector_data_register (& vector_handle, 0, 8 ( u i n t p t r _ t ) vector, NX, sizeof ( v e c t o r [ 0 ] ) ) ; 9 10 starpu_task_insert ( 11 &s c a l _ c l, 12 STARPU_RW, vector_handle, 13 STARPU_VALUE, &f a c t o r, sizeof ( f a c t o r ), 14 0) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 44

129 Basic Example: Scaling a Vector 1 f l o a t f a c t o r = ; 2 f l o a t v e c t o r [NX ] ; 3 starpu_data_handle_t vector_ handle ; 4 5 /... f i l l v e c t o r... / 6 7 starpu_vector_data_register (& vector_handle, 0, 8 ( u i n t p t r _ t ) vector, NX, sizeof ( v e c t o r [ 0 ] ) ) ; 9 10 starpu_task_insert ( 11 &s c a l _ c l, 12 STARPU_RW, vector_handle, 13 STARPU_VALUE, &f a c t o r, sizeof ( f a c t o r ), 14 0) ; starpu_task_wait_for_all ( ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 45

130 Basic Example: Scaling a Vector 1 f l o a t f a c t o r = ; 2 f l o a t v e c t o r [NX ] ; 3 starpu_data_handle_t vector_ handle ; 4 5 /... f i l l v e c t o r... / 6 7 starpu_vector_data_register (& vector_handle, 0, 8 ( u i n t p t r _ t ) vector, NX, sizeof ( v e c t o r [ 0 ] ) ) ; 9 10 starpu_task_insert ( 11 &s c a l _ c l, 12 STARPU_RW, vector_handle, 13 STARPU_VALUE, &f a c t o r, sizeof ( f a c t o r ), 14 0) ; starpu_task_wait_for_all ( ) ; 17 starpu_data_unregister ( vector_ handle ) ; /... d i s p l a y v e c t o r... / Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 46

131 Declaring a Codelet to StarPU Define a struct starpu_codelet 1 struct starpu_codelet s c a l _ c l = { } ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 47

132 Declaring a Codelet to StarPU Define a struct starpu_codelet Plug the kernel function Here: scal_cpu_func 1 struct starpu_codelet s c a l _ c l = { 2. cpu_func = { scal_cpu_func, NULL }, } ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 47

133 Declaring a Codelet to StarPU Define a struct starpu_codelet Plug the kernel function Here: scal_cpu_func Declare the number of data pieces used by the kernel Here: A single vector 1 struct starpu_codelet s c a l _ c l = { 2. cpu_func = { scal_cpu_func, NULL }, 3. n b u f f e r s = 1, } ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 47

134 Declaring a Codelet to StarPU Define a struct starpu_codelet Plug the kernel function Here: scal_cpu_func Declare the number of data pieces used by the kernel Here: A single vector Declare how the kernel accesses the piece of data Here: The vector is scaled in-place, thus R/W 1 struct starpu_codelet s c a l _ c l = { 2. cpu_func = { scal_cpu_func, NULL }, 3. n b u f f e r s = 1, 4. modes = { STARPU_RW }, 5 } ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 47

135 Declaring and Managing Data Put data under StarPU control Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 48

136 Declaring and Managing Data Put data under StarPU control Initialize a piece of data 1 f l o a t v e c t o r [NX ] ; 2 /... f i l l data... / Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 48

137 Declaring and Managing Data Put data under StarPU control Initialize a piece of data Register the piece of data and get a handle The vector is now under StarPU control 1 f l o a t v e c t o r [NX ] ; 2 /... f i l l data... / 3 4 starpu_data_handle_t vector_ handle ; 5 starpu_vector_data_register (& vector_handle, 0, 6 ( u i n t p t r _ t ) vector, NX, sizeof ( v e c t or [ 0 ] ) ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 48

138 Declaring and Managing Data Put data under StarPU control Initialize a piece of data Register the piece of data and get a handle The vector is now under StarPU control Use data through the handle 1 f l o a t v e c t o r [NX ] ; 2 /... f i l l data... / 3 4 starpu_data_handle_t vector_ handle ; 5 starpu_vector_data_register (& vector_handle, 0, 6 ( u i n t p t r _ t ) vector, NX, sizeof ( v e c t or [ 0 ] ) ) ; 7 8 /... use the v e c t o r through the handle... / Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 48

139 Declaring and Managing Data Put data under StarPU control Initialize a piece of data Register the piece of data and get a handle The vector is now under StarPU control Use data through the handle Unregister the piece of data The handle is destroyed The vector is now back under user control 1 f l o a t v e c t o r [NX ] ; 2 /... f i l l data... / 3 4 starpu_data_handle_t vector_ handle ; 5 starpu_vector_data_register (& vector_handle, 0, 6 ( u i n t p t r _ t ) vector, NX, sizeof ( v e c t or [ 0 ] ) ) ; 7 8 /... use the v e c t o r through the handle... / 9 10 starpu_data_unregister ( vector_ handle ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 48

140 Writing a Kernel Function Every kernel function has the same C prototype 1 void scal_cpu_func ( void b u f f e r s [ ], void c l _ a r g ) { } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 49

141 Writing a Kernel Function Every kernel function has the same C prototype Retrieve the vector s handle 1 void scal_cpu_func ( void b u f f e r s [ ], void c l _ a r g ) { 2 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 49

142 Writing a Kernel Function Every kernel function has the same C prototype Retrieve the vector s handle Get vector s number of elements and base pointer 1 void scal_cpu_func ( void b u f f e r s [ ], void c l _ a r g ) { 2 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 3 4 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 5 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 49

143 Writing a Kernel Function Every kernel function has the same C prototype Retrieve the vector s handle Get vector s number of elements and base pointer Get the scaling factor as inline argument 1 void scal_cpu_func ( void b u f f e r s [ ], void c l _ a r g ) { 2 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 3 4 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 5 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; 6 7 f l o a t p t r _ f a c t o r = c l _ a r g ; } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 49

144 Writing a Kernel Function Every kernel function has the same C prototype Retrieve the vector s handle Get vector s number of elements and base pointer Get the scaling factor as inline argument Compute the vector scaling 1 void scal_cpu_func ( void b u f f e r s [ ], void c l _ a r g ) { 2 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 3 4 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 5 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; 6 7 f l o a t p t r _ f a c t o r = c l _ a r g ; 8 9 unsigned i ; 10 for ( i = 0; i < n ; i ++) 11 v e ctor [ i ] = p t r _ f a c t o r ; 12 } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 49

145 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50

146 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Arguments The codelet structure 1 starpu_task_insert (& s c a l _ c l 2... ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50

147 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Arguments The codelet structure The StarPU-managed data 1 starpu_task_insert (& s c a l _ c l, 2 STARPU_RW, vector_handle, 3... ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50

148 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Arguments The codelet structure The StarPU-managed data The small-size inline data 1 starpu_task_insert (& s c a l _ c l, 2 STARPU_RW, vector_handle, 3 STARPU_VALUE, &f a c t o r, sizeof ( f a c t o r ), 4... ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50

149 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Arguments The codelet structure The StarPU-managed data The small-size inline data 0 to mark the end of arguments 1 starpu_task_insert (& s c a l _ c l, 2 STARPU_RW, vector_handle, 3 STARPU_VALUE, &f a c t o r, sizeof ( f a c t o r ), 4 0) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50

150 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Arguments The codelet structure The StarPU-managed data The small-size inline data 0 to mark the end of arguments Notes The task is submitted non-blockingly Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50

151 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Arguments The codelet structure The StarPU-managed data The small-size inline data 0 to mark the end of arguments Notes The task is submitted non-blockingly Dependencies are enforced with previously submitted tasks data... Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50

152 Submitting a task The starpu_task_insert call Inserts a task in the StarPU DAG Arguments The codelet structure The StarPU-managed data The small-size inline data 0 to mark the end of arguments Notes The task is submitted non-blockingly Dependencies are enforced with previously submitted tasks data following the natural order of the program Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 50

153 Waiting for Submitted Task Completion Tasks are submitted non-blockingly Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 51

154 Waiting for Submitted Task Completion Tasks are submitted non-blockingly 1 / non b l o c k i n g task submits / 2 starpu_task_insert (... ) ; 3 starpu_task_insert (... ) ; 4 starpu_task_insert (... ) ; 5... Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 51

155 Waiting for Submitted Task Completion Tasks are submitted non-blockingly Wait for all submitted tasks to complete their work 1 / non b l o c k i n g task submits / 2 starpu_task_insert (... ) ; 3 starpu_task_insert (... ) ; 4 starpu_task_insert (... ) ; 5... Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 51

156 Waiting for Submitted Task Completion Tasks are submitted non-blockingly Wait for all submitted tasks to complete their work 1 / non b l o c k i n g task submits / 2 starpu_task_insert (... ) ; 3 starpu_task_insert (... ) ; 4 starpu_task_insert (... ) ; / wait f o r a l l task submitted so f a r / 8 s t a r p u _ t a s k _ w a i t _ f o r _ a l l ( ) ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 51

157 Heterogeneity: Declaring Device-Specific Kernels Extending a codelet to handle heterogeneous platforms Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 52

158 Heterogeneity: Declaring Device-Specific Kernels Extending a codelet to handle heterogeneous platforms Multiple kernel implementations for a CPU SSE, AVX,... optimized kernels 1 struct starpu_codelet s c a l _ c l = { 2. cpu_func = { scal_cpu_func, scal_sse_cpu_func, NULL }, 3. n b u f f e r s = 1, 4. modes = { STARPU_RW }, 5 } ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 52

159 Heterogeneity: Declaring Device-Specific Kernels Extending a codelet to handle heterogeneous platforms Multiple kernel implementations for a CPU SSE, AVX,... optimized kernels Kernels implementations for accelerator devices OpenCL, NVidia Cuda kernels 1 struct starpu_codelet s c a l _ c l = { 2. cpu_func = { scal_cpu_func, scal_sse_cpu_func, NULL }, 3. opencl_func = { scal_cpu_opencl, NULL }, 4. cuda_func = { scal_cpu_cuda, NULL }, 5. n b u f f e r s = 1, 6. modes = { STARPU_RW }, 7 } ; Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 52

160 Writing a Kernel Function for CUDA Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 53

161 Writing a Kernel Function for CUDA extern "C" void scal_cuda_func ( void b u f f e r s [ ], void c l _ a r g ) { 9 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 10 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 11 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; 12 f l o a t p t r _ f a c t o r = c l _ a r g ; } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 53

162 Writing a Kernel Function for CUDA extern "C" void scal_cuda_func ( void b u f f e r s [ ], void c l _ a r g ) { 9 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 10 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 11 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; 12 f l o a t p t r _ f a c t o r = c l _ a r g ; unsigned threads_ per_ block = 64; 15 unsigned nblocks = ( n+threads_per_block 1) / threads_ per_ block ; } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 53

163 Writing a Kernel Function for CUDA extern "C" void scal_cuda_func ( void b u f f e r s [ ], void c l _ a r g ) { 9 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 10 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 11 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; 12 f l o a t p t r _ f a c t o r = c l _ a r g ; unsigned threads_ per_ block = 64; 15 unsigned nblocks = ( n+threads_per_block 1) / threads_ per_ block ; vector_mult_cuda<<<nblocks, threads_per_block, 0, 18 starpu_cuda_get_local_stream ( ) >>> ( n, vector, p t r _ f a c t o r ) ; 19 } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 53

164 Writing a Kernel Function for CUDA 1 s t a t i c global void vector_mult_cuda ( unsigned n, 2 f l o a t vector, f l o a t f a c t o r ) { 3 unsigned i = b l o c k I d x. x blockdim. x + threadidx. x ; } 7 8 extern "C" void scal_cuda_func ( void b u f f e r s [ ], void c l _ a r g ) { 9 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 10 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 11 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; 12 f l o a t p t r _ f a c t o r = c l _ a r g ; unsigned threads_ per_ block = 64; 15 unsigned nblocks = ( n+threads_per_block 1) / threads_ per_ block ; vector_mult_cuda<<<nblocks, threads_per_block, 0, 18 starpu_cuda_get_local_stream ( ) >>> ( n, vector, p t r _ f a c t o r ) ; 19 } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 53

165 Writing a Kernel Function for CUDA 1 s t a t i c global void vector_mult_cuda ( unsigned n, 2 f l o a t vector, f l o a t f a c t o r ) { 3 unsigned i = b l o c k I d x. x blockdim. x + threadidx. x ; 4 i f ( i < n ) 5 v e c t o r [ i ] = f a c t o r ; 6 } 7 8 extern "C" void scal_cuda_func ( void b u f f e r s [ ], void c l _ a r g ) { 9 struct starpu_vector_interface vector_handle = b u f f e r s [ 0 ] ; 10 unsigned n = STARPU_VECTOR_GET_NX ( vector_ handle ) ; 11 f l o a t ve ctor = STARPU_VECTOR_GET_PTR ( vector_ handle ) ; 12 f l o a t p t r _ f a c t o r = c l _ a r g ; unsigned threads_ per_ block = 64; 15 unsigned nblocks = ( n+threads_per_block 1) / threads_ per_ block ; vector_mult_cuda<<<nblocks, threads_per_block, 0, 18 starpu_cuda_get_local_stream ( ) >>> ( n, vector, p t r _ f a c t o r ) ; 19 } Olivier Aumage STORM Team The StarPU Runtime 3. StarPU Programming Example 53

166 4StarPU Internals Olivier Aumage STORM Team The StarPU Runtime 54

167 StarPU Internal Structure HPC Applications High-level data management library Execution model Scheduling engine Specific drivers CPUs GPUs SPUs... Mastering CPUs, GPUs, SPUs *PU StarPU Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 55

168 StarPU Internal Functioning Application Memory Management (DSM) Scheduling engine A B GPU driver CPU driver #k RAM GPU... CPU#k A B Submit task «A+=B» Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56

169 StarPU Internal Functioning Application A = A+B Memory Management (DSM) Scheduling engine A B GPU driver CPU driver #k RAM GPU... CPU#k A B Submit task «A+=B» Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56

170 StarPU Internal Functioning Application Memory Management (DSM) A B A = A+B GPU driver Scheduling engine CPU driver #k RAM GPU... CPU#k A B Schedule task Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56

171 StarPU Internal Functioning Application Memory Management (DSM) A B A = A+B GPU driver Scheduling engine CPU driver #k RAM GPU... CPU#k A B Fetch data Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56

172 StarPU Internal Functioning Application Memory Management (DSM) A B A = A+B GPU driver Scheduling engine CPU driver #k RAM GPU... CPU#k A A B Fetch data Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56

173 StarPU Internal Functioning Application Memory Management (DSM) A B A = A+B GPU driver Scheduling engine CPU driver #k RAM GPU... CPU#k A A B Fetch data Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56

174 StarPU Internal Functioning Application Memory Management (DSM) Scheduling engine A B GPU driver CPU driver #k RAM GPU... A = A+B CPU#k A A B Offload computation Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56

175 StarPU Internal Functioning Application Memory Management (DSM) Scheduling engine A B GPU driver CPU driver #k RAM GPU... CPU#k A A B Notify termination Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 56

176 StarPU Scheduling Policies No one size fits all policy Applications may have very different characteristics Applications may have very different scheduling needs Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 57

177 StarPU Scheduling Policies No one size fits all policy Applications may have very different characteristics Applications may have very different scheduling needs Selectable scheduling policy Predefined set of popular policies Eager Work Stealing Priority Performance model based policies Extensible policy set Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 57

178 The Eager Scheduler First come, first served policy Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

179 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

180 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

181 The Eager Scheduler Submit First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

182 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

183 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

184 The Eager Scheduler Submit First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

185 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

186 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

187 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

188 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

189 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

190 The Eager Scheduler First come, first served policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 58

191 The Work Stealing Scheduler Load balancing policy Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59

192 The Work Stealing Scheduler Load balancing policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59

193 The Work Stealing Scheduler Load balancing policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59

194 The Work Stealing Scheduler Load balancing policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59

195 The Work Stealing Scheduler Load balancing policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59

196 The Work Stealing Scheduler Load balancing policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59

197 The Work Stealing Scheduler Load balancing policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59

198 The Work Stealing Scheduler Load balancing policy CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 59

199 Selecting a Scheduling Policy Use the STARPU_SCHED environment variable Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 60

200 Selecting a Scheduling Policy Use the STARPU_SCHED environment variable Example 1: selecting the prio scheduler 1 $ export STARPU_SCHED= p r i o 2 $ my_program 3... Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 60

201 Selecting a Scheduling Policy Use the STARPU_SCHED environment variable Example 1: selecting the prio scheduler Example 2: selecting the ws scheduler 1 $ export STARPU_SCHED= p r i o 2 $ my_program $ export STARPU_SCHED=ws 2 $ my_program 3... Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 60

202 Selecting a Scheduling Policy Use the STARPU_SCHED environment variable Example 1: selecting the prio scheduler Example 2: selecting the ws scheduler Example 3: resetting to default scheduler eager 1 $ export STARPU_SCHED= p r i o 2 $ my_program $ export STARPU_SCHED=ws 2 $ my_program $ unset STARPU_SCHED 2 $ my_program 3... Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 60

203 Selecting a Scheduling Policy Use the STARPU_SCHED environment variable Example 1: selecting the prio scheduler Example 2: selecting the ws scheduler Example 3: resetting to default scheduler eager No need to recompile the application 1 $ export STARPU_SCHED= p r i o 2 $ my_program $ export STARPU_SCHED=ws 2 $ my_program $ unset STARPU_SCHED 2 $ my_program 3... Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 60

204 Going Beyond Scheduling is a decision process Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 61

205 Going Beyond Scheduling is a decision process Providing more input to the scheduler... Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 61

206 Going Beyond Scheduling is a decision process Providing more input to the scheduler can lead to better scheduling decisions Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 61

207 Going Beyond Scheduling is a decision process Providing more input to the scheduler can lead to better scheduling decisions What kind of information? Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 61

208 Going Beyond Scheduling is a decision process Providing more input to the scheduler can lead to better scheduling decisions What kind of information? Relative importance of tasks Priorities Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 61

209 Going Beyond Scheduling is a decision process Providing more input to the scheduler can lead to better scheduling decisions What kind of information? Relative importance of tasks Priorities Cost of tasks Codelet models Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 61

210 Going Beyond Scheduling is a decision process Providing more input to the scheduler can lead to better scheduling decisions What kind of information? Relative importance of tasks Priorities Cost of tasks Codelet models Cost of transferring data Bus calibration Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 61

211 The Prio Scheduler Describe the relative importance of tasks Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 62

212 The Prio Scheduler Describe the relative importance of tasks Assign priorities to tasks Values: Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 62

213 The Prio Scheduler Describe the relative importance of tasks Assign priorities to tasks Values: Tell which task matter Tasks that unlock key data pieces Tasks that generate a lot of parallelism Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 62

214 The Prio Scheduler Describe the relative importance of tasks Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

215 The Prio Scheduler Describe the relative importance of tasks Submit 3 Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

216 The Prio Scheduler Describe the relative importance of tasks 3 Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

217 The Prio Scheduler Describe the relative importance of tasks 3 Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

218 The Prio Scheduler Describe the relative importance of tasks Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

219 The Prio Scheduler Describe the relative importance of tasks Submit 1 Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

220 The Prio Scheduler Describe the relative importance of tasks 1 Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

221 The Prio Scheduler Describe the relative importance of tasks 1 Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

222 The Prio Scheduler Describe the relative importance of tasks Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

223 The Prio Scheduler Describe the relative importance of tasks Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

224 The Prio Scheduler Describe the relative importance of tasks Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

225 The Prio Scheduler Describe the relative importance of tasks Prio 3 Prio 2 Prio 1 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 63

226 The Deque Model (dm) Scheduler Inspired by HEFT popular scheduling algorithm Heterogeneous Earliest Finish Time Try to get the best from accelerators and CPUs Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 64

227 The Deque Model (dm) Scheduler Inspired by HEFT popular scheduling algorithm Heterogeneous Earliest Finish Time Try to get the best from accelerators and CPUs Using codelet performance models Kernel calibration on each available computing device Raw history model of kernels past execution times Refined models using regression on kernels execution times history Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 64

228 The Deque Model (dm) Scheduler Inspired by HEFT popular scheduling algorithm Heterogeneous Earliest Finish Time Try to get the best from accelerators and CPUs Using codelet performance models Kernel calibration on each available computing device Raw history model of kernels past execution times Refined models using regression on kernels execution times history Model parameter Data size by default User-defined for more complex cases Sparse data structures (e.g. NNZ for instance) Iterative kernels Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 64

229 The Deque Model (dm) Scheduler Using codelet performance models CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65

230 The Deque Model (dm) Scheduler Using codelet performance models Submit CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65

231 The Deque Model (dm) Scheduler Using codelet performance models CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65

232 The Deque Model (dm) Scheduler Using codelet performance models CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65

233 The Deque Model (dm) Scheduler Using codelet performance models? CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65

234 The Deque Model (dm) Scheduler Using codelet performance models? CPU Cores GPU 1 GPU 2 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65

235 The Deque Model (dm) Scheduler Using codelet performance models? Time CPU Cores GPU 1 GPU 2 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65

236 The Deque Model (dm) Scheduler Using codelet performance models? Time CPU Cores GPU 1 GPU 2 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65

237 The Deque Model (dm) Scheduler Using codelet performance models Time CPU Cores GPU 1 GPU 2 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65

238 The Deque Model (dm) Scheduler Using codelet performance models Time CPU Cores GPU 1 GPU 2 CPU Cores GPU 1 GPU 2 Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 65

239 Data Transfer Modelling Discrete accelerators CPU GPU transfers Data transfer cost vs kernel offload benefit Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 66

240 Data Transfer Modelling Discrete accelerators CPU GPU transfers Data transfer cost vs kernel offload benefit Transfer cost modelling Bus calibration Can differ even for identical devices Platform s topology Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 66

241 Data Transfer Modelling Discrete accelerators CPU GPU transfers Data transfer cost vs kernel offload benefit Transfer cost modelling Bus calibration Can differ even for identical devices Platform s topology Data-transfer aware scheduling Deque Model Data Aware (dmda) policy variants Tunable data transfer cost bias locality vs load balancing Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 66

242 Prefetching Use the DAG to anticipate on required data, overlap transfer with computation Task states Submitted Task inserted by the application Ready Task s dependencies resolved Scheduled Task queued on a computing unit Executing Task running on a computing unit Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 67

243 Prefetching Use the DAG to anticipate on required data, overlap transfer with computation Task states Submitted Task inserted by the application Ready Task s dependencies resolved Scheduled Task queued on a computing unit Executing Task running on a computing unit Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 67

244 Prefetching Use the DAG to anticipate on required data, overlap transfer with computation Task states Submitted Task inserted by the application Ready Task s dependencies resolved Scheduled Task queued on a computing unit Executing Task running on a computing unit Anticipate on the Scheduled Executing transition Prefetch triggered ASAP after Scheduled state Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 67

245 Prefetching Use the DAG to anticipate on required data, overlap transfer with computation Task states Submitted Task inserted by the application Ready Task s dependencies resolved Scheduled Task queued on a computing unit Executing Task running on a computing unit Anticipate on the Scheduled Executing transition Prefetch triggered ASAP after Scheduled state Prefetch may also be triggered by the application Olivier Aumage STORM Team The StarPU Runtime 4. StarPU Internals 67

246 5Extended Features Olivier Aumage STORM Team The StarPU Runtime 68

247 Extended Features Platform Support Distributed Computing: StarPU-MPI Out-of-core support Intel MIC / Xeon Phi support SimGrid support Programming Support OpenMP 4.0 compiler: Klang-OMP OpenCL backend Scheduling Support Composition on multi and many cores Scheduling contexts Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 69

248 Distributed Computing: StarPU-MPI Extending StarPU s Paradigm on Clusters No global scheduler Task Node Mapping Provided by the application through the initial data distribution Can be altered dynamically node0 node1 Irecv node2 node3 node0 node1 Isend node2 node3 Node 1 Node 2 Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 70

249 Communication Requests Nodes infer required transfers Task dependencies Automatic MPI calls Isend Irecv Tasks wait for MPI requests node1 Isend Irecv node0 POTRF TRSM SYRK GEMM Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 71

250 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72

251 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72

252 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72

253 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72

254 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72

255 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72

256 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72

257 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72

258 Out-of-Core Enable StarPU to evict temporarily unused data to disk CPU GPU0 Integration with general StarPU s memory management layer StarPU data handles Task dependencies Data reloaded automatically Multiple disk drivers supported Legacy stdio/unistd methods Google s LevelDB (key/value database library) MEM CPU GPU1 Disk Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 72

259 Support for Manycore Accelerators Manycores Intel xeon phi (MIC) Also: Intel SCC (Single-Chip Cloud) Technical details Support shared for common characteristics Several StarPU instances One CPU instance (the source) One instance per manycore accelerator (the sink) Scheduling performed by the main CPU StarPU instance Separate compilation for CPU code and MIC code Straightforward port Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 73

260 Klang-omp OpenMP C/C++ Compiler High level programming Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 74

261 Klang-omp OpenMP C/C++ Compiler High level programming Translate directives into runtime system API calls StarPU Runtime System XKaapi Runtime System (INRIA Team MOAIS) Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 74

262 Klang-omp OpenMP C/C++ Compiler High level programming Translate directives into runtime system API calls StarPU Runtime System XKaapi Runtime System (INRIA Team MOAIS) OpenMP 3.1 Virtually full support OpenMP 4.0 Dependent tasks Heterogeneous targets (on-going work) Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 74

263 Klang-omp OpenMP C/C++ Compiler High level programming Translate directives into runtime system API calls StarPU Runtime System XKaapi Runtime System (INRIA Team MOAIS) OpenMP 3.1 Virtually full support OpenMP 4.0 Dependent tasks Heterogeneous targets (on-going work) LLVM-based source-to-source compiler Builds on open source Intel compiler clang-omp Available on: K Star project website Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 74

264 Klang-omp Example: Tasks 1 i n t item [N ] ; 2 3 void g ( i n t ) ; 4 5 void f ( ) 6 { 7 #pragma omp p a r a l l e l 8 { 9 #pragma omp single 10 { 11 i n t i ; 12 for ( i =0; i <N; i ++) 13 #pragma omp task unti ed 14 g ( item [ i ] ) ; 15 } 16 } 17 } Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 75

265 Klang-omp Example: Dependencies 1 void f ( ) 2 { 3 i n t a ; 4 5 #pragma omp p a r a l l e l 6 #pragma omp single 7 { 8 #pragma omp task shared ( a ) depend ( out : a ) 9 foo (&a ) ; #pragma omp task shared ( a ) depend ( in : a ) 12 bar (&a ) ; 13 } 14 } Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 76

266 Klang-omp Example: Targets 1 #pragma omp declare target 2 extern void g ( i n t A, i n t B, i n t C) ; 3 #pragma omp end declare target 4 5 i n t A [N], B [N], C[N ] ; 6 7 void f ( ) 8 { 9 i n t a ; 10 #pragma omp p a r a l l e l 11 #pragma omp master 12 #pragma omp target map( to : A, B) map( from : C) 13 g (A, B, C) ; 14 } Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 77

267 SOCL Layer StarPU as an OpenCL Backend High level programming Generic OpenCL Applications StarPU OpenCL layer StarPU Drivers OpenCL CPU cores GPU OpenCL devices Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 78

268 SOCL Layer StarPU as an OpenCL Backend High level programming SOCL Rationale Run generic OpenCL codes on top of StarPU Generic OpenCL Applications StarPU OpenCL layer StarPU Drivers OpenCL CPU cores GPU OpenCL devices Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 78

269 SOCL Layer StarPU as an OpenCL Backend High level programming SOCL Rationale Run generic OpenCL codes on top of StarPU Generic OpenCL Applications StarPU OpenCL layer Technical details StarPU as an OpenCL backend ICD: Installable Client Driver Redirects OpenCL calls to StarPU routines CPU cores StarPU GPU Drivers OpenCL OpenCL devices Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 78

270 SOCL Layer StarPU as an OpenCL Backend High level programming SOCL Rationale Run generic OpenCL codes on top of StarPU Generic OpenCL Applications StarPU OpenCL layer Technical details StarPU as an OpenCL backend ICD: Installable Client Driver Kernels Redirects OpenCL calls to StarPU routines SOCL can itself use OpenCL Kernels CPU cores StarPU GPU Drivers OpenCL OpenCL devices Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 78

271 Composition: Scheduling contexts Rationale Sharing computing resources among multiple DAGs... simultaneously Composing codes, kernels Scheduling contexts Map DAGs on subsets of computing units Isolate competing kernels or library calls OpenMP kernel, Intel MKL, etc. Select scheduling policy per context Context 1 Context 2 Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 79

272 Contexts: Dynamic Resource Management Context 2 Context 1 CPU CPU CPU CPU GPU GPU Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 80

273 Debugging/Analysis Support Online Tools Visual debugging with TEMANEJO High Performance Computing Center Stuttgart (HLRS) Visual task debugging GUI Offline Tools Statistics Performance models Theoretical lower bound Trace-based analysis Gantt Graphviz DAG R plots Olivier Aumage STORM Team The StarPU Runtime 5. Extended Features 81

Introduction to Runtime Systems

Introduction to Runtime Systems Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

Getting started with StarPU

Getting started with StarPU 1 Getting started with StarPU Cédric Augonnet Nathalie Furmento Samuel Thibault Raymond Namyst INRIA Bordeaux, LaBRI, Université de Bordeaux SAAHPC 2010 UT Tools Tutorial Knoxville 15 th July 2010 The

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

45-year CPU Evolution: 1 Law -2 Equations

45-year CPU Evolution: 1 Law -2 Equations 4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 /CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:

More information

Addressing Heterogeneity in Manycore Applications

Addressing Heterogeneity in Manycore Applications Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc. Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

StarPU: a runtime system for multigpu multicore machines

StarPU: a runtime system for multigpu multicore machines StarPU: a runtime system for multigpu multicore machines Raymond Namyst RUNTIME group, INRIA Bordeaux Journées du Groupe Calcul Lyon, November 2010 The RUNTIME Team High Performance Runtime Systems for

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

Programming heterogeneous, accelerator-based multicore machines: a runtime system's perspective

Programming heterogeneous, accelerator-based multicore machines: a runtime system's perspective Programming heterogeneous, accelerator-based multicore machines: a runtime system's perspective Raymond Namyst University of Bordeaux Head of RUNTIME group, INRIA École CEA-EDF-INRIA 2011 Sophia-Antipolis,

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

Parallel Programming on Larrabee. Tim Foley Intel Corp

Parallel Programming on Larrabee. Tim Foley Intel Corp Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

Parallel and Distributed Programming Introduction. Kenjiro Taura

Parallel and Distributed Programming Introduction. Kenjiro Taura Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Programmer's View of Execution Teminology Summary

Programmer's View of Execution Teminology Summary CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 28: GP-GPU Programming GPUs Hardware specialized for graphics calculations Originally developed to facilitate the use of CAD programs

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Hierarchical DAG Scheduling for Hybrid Distributed Systems June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

Thinking Outside of the Tera-Scale Box. Piotr Luszczek Thinking Outside of the Tera-Scale Box Piotr Luszczek Brief History of Tera-flop: 1997 1997 ASCI Red Brief History of Tera-flop: 2007 Intel Polaris 2007 1997 ASCI Red Brief History of Tera-flop: GPGPU

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

CUDA (Compute Unified Device Architecture)

CUDA (Compute Unified Device Architecture) CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

Parallel Systems. Project topics

Parallel Systems. Project topics Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a

More information

Antonio R. Miele Marco D. Santambrogio

Antonio R. Miele Marco D. Santambrogio Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First

More information

CSC573: TSHA Introduction to Accelerators

CSC573: TSHA Introduction to Accelerators CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest The Heterogeneous Programming Jungle Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest June 19, 2012 Outline 1. Introduction 2. Heterogeneous System Zoo 3. Similarities 4. Programming

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Real-Time Rendering Architectures

Real-Time Rendering Architectures Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information

Introduction to GPU programming with CUDA

Introduction to GPU programming with CUDA Introduction to GPU programming with CUDA Dr. Juan C Zuniga University of Saskatchewan, WestGrid UBC Summer School, Vancouver. June 12th, 2018 Outline 1 Overview of GPU computing a. what is a GPU? b. GPU

More information

Improving Performance of Machine Learning Workloads

Improving Performance of Machine Learning Workloads Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Scientific Computing on GPUs: GPU Architecture Overview

Scientific Computing on GPUs: GPU Architecture Overview Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

! XKaapi : a runtime for highly parallel (OpenMP) application

! XKaapi : a runtime for highly parallel (OpenMP) application XKaapi : a runtime for highly parallel (OpenMP) application Thierry Gautier thierry.gautier@inrialpes.fr MOAIS, INRIA, Grenoble C2S@Exa 10-11/07/2014 Agenda - context - objective - task model and execution

More information

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008 Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated

More information

A low memory footprint OpenCL simulation of short-range particle interactions

A low memory footprint OpenCL simulation of short-range particle interactions A low memory footprint OpenCL simulation of short-range particle interactions Raymond Namyst STORM INRIA Group With Samuel Pitoiset, Inria and Emmanuel Cieren, Laurent Colombet, Laurent Soulard, CEA/DAM/DPTA

More information

CS516 Programming Languages and Compilers II

CS516 Programming Languages and Compilers II CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Jan 22 Overview and GPU Programming I Rutgers University CS516 Course Information Staff Instructor: zheng zhang (eddy.zhengzhang@cs.rutgers.edu)

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

COSC 6339 Accelerators in Big Data

COSC 6339 Accelerators in Big Data COSC 6339 Accelerators in Big Data Edgar Gabriel Fall 2018 Motivation Programming models such as MapReduce and Spark provide a high-level view of parallelism not easy for all problems, e.g. recursive algorithms,

More information

Preparing seismic codes for GPUs and other

Preparing seismic codes for GPUs and other Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

Dealing with Heterogeneous Multicores

Dealing with Heterogeneous Multicores Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

Performance of deal.ii on a node

Performance of deal.ii on a node Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

GPU CUDA Programming

GPU CUDA Programming GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

Parallel Hybrid Computing Stéphane Bihan, CAPS

Parallel Hybrid Computing Stéphane Bihan, CAPS Parallel Hybrid Computing Stéphane Bihan, CAPS Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous hardware

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

Parallelism. CS6787 Lecture 8 Fall 2017

Parallelism. CS6787 Lecture 8 Fall 2017 Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

Toward a supernodal sparse direct solver over DAG runtimes

Toward a supernodal sparse direct solver over DAG runtimes Toward a supernodal sparse direct solver over DAG runtimes HOSCAR 2013, Bordeaux X. Lacoste Xavier LACOSTE HiePACS team Inria Bordeaux Sud-Ouest November 27, 2012 Guideline Context and goals About PaStiX

More information

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms. Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

General introduction: GPUs and the realm of parallel architectures

General introduction: GPUs and the realm of parallel architectures General introduction: GPUs and the realm of parallel architectures GPU Computing Training August 17-19 th 2015 Jan Lemeire (jan.lemeire@vub.ac.be) Graduated as Engineer in 1994 at VUB Worked for 4 years

More information