Introduction to Runtime Systems

Size: px
Start display at page:

Download "Introduction to Runtime Systems"

Transcription

1 Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation

2 Contents 1. Introduction 2. Computing Hardware 3. Parallel Programming Models 4. Computing Runtime Systems Team Storm Olivier Aumage Runtime Systems 2

3 1Introduction Team Storm Olivier Aumage Runtime Systems 3

4 Hardware Evolution More capabilities, more complexity Team Storm Olivier Aumage Runtime Systems 1. Introduction 4

5 Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Team Storm Olivier Aumage Runtime Systems 1. Introduction 4

6 Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing Team Storm Olivier Aumage Runtime Systems 1. Introduction 4

7 Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing I/O RAID SSD vs Disks Network-attached disks Parallel file systems Team Storm Olivier Aumage Runtime Systems 1. Introduction 4

8 Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing I/O RAID SSD vs Disks Network-attached disks Parallel file systems Computing Multiprocessors, multicores Vector processing extensions Accelerators Team Storm Olivier Aumage Runtime Systems 1. Introduction 4

9 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

10 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

11 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

12 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

13 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

14 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

15 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

16 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Vendor-tied code? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

17 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Vendor-tied code? Use runtime systems! Team Storm Olivier Aumage Runtime Systems 1. Introduction 5

18 The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Team Storm Olivier Aumage Runtime Systems 1. Introduction 6

19 The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Team Storm Olivier Aumage Runtime Systems 1. Introduction 6

20 The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Adaptiveness Load balancing Monitoring, sampling, calibrating Team Storm Olivier Aumage Runtime Systems 1. Introduction 6

21 The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Adaptiveness Load balancing Monitoring, sampling, calibrating Optimization Requests aggregation Resource locality Computation offload Computation/transfer overlap Team Storm Olivier Aumage Runtime Systems 1. Introduction 6

22 Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Team Storm Olivier Aumage Runtime Systems 1. Introduction 7

23 Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL Team Storm Olivier Aumage Runtime Systems 1. Introduction 7

24 Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL I/O MPI-IO Database engines (Google LevelDB) Team Storm Olivier Aumage Runtime Systems 1. Introduction 7

25 Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL I/O MPI-IO Database engines (Google LevelDB) Computing runtime systems?... Team Storm Olivier Aumage Runtime Systems 1. Introduction 7

26 2Computing Hardware Team Storm Olivier Aumage Runtime Systems 8

27 Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 9

28 Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Hardware Parallelism Multiply existing processing power Have several processing units work together Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 9

29 Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Hardware Parallelism Multiply existing processing power Have several processing units work together Not a new idea but becoming the key performance factor Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 9

30 Processor Parallelisms Various forms of hardware parallelism Multiprocessors Multicores Hardware multithreading (SMT) Vector processing (SIMD) Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 10

31 Processor Parallelisms Various forms of hardware parallelism Multiprocessors Multicores Hardware multithreading (SMT) Vector processing (SIMD) Multiple forms may be combined Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 10

32 Multiprocessors and Multicores Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 11

33 Multiprocessors and Multicores Multiprocessors Full processor replicates Rationale: Share node contents Share memory and devices Memory sharing may involve non-uniformity See upcoming hwloc and TreeMatch talks...! Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 11

34 Multiprocessors and Multicores Multiprocessors Full processor replicates Rationale: Share node contents Share memory and devices Memory sharing may involve non-uniformity See upcoming hwloc and TreeMatch talks...! Multicores Processor circuit replicates (cores) printed on the same dye Rationale: Use available dye area for more processing power Shrinking process Share memory and devices May share some additional dye circuitry (cache(s), uncore services) See upcoming hwloc and TreeMatch talks...! Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 11

35 Multiprocessors and Multicores Taking advantage of them? Needs multiple parallel application activities Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 12

36 Multiprocessors and Multicores Taking advantage of them? Needs multiple parallel application activities Additional considerations Availability Work mapping issues Locality issues Memory bandwidth issues Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 12

37 Hardware Multithreading Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 13

38 Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 13

39 Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Taking advantage of it? Needs multiple parallel application activities Highly dependent of application activities characteristics Complementary vs competitive Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 13

40 Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Taking advantage of it? Needs multiple parallel application activities Highly dependent of application activities characteristics Complementary vs competitive Additional considerations Availability Work mapping issues Locality issues Memory bandwidth issues Benefit vs loss Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 13

41 Vector Processing Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 14

42 Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 14

43 Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Taking advantage of it? Specially written kernels Compiler Use of assembly language Intrinsics Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 14

44 Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Taking advantage of it? Specially written kernels Compiler Use of assembly language Intrinsics Additional considerations Availability Feature set/variants MMX 3dnow! SSE [2...5] AVX... Benefit vs loss Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 14

45 Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 15

46 Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Single Instruction Multiple Threads (SIMT) A single control unit for several computing units Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 15

47 Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Streaming Multiprocessor Control R1 + R2 Scalar Cores (Streaming Processors) Single Instruction Multiple Threads (SIMT) A single control unit for several computing units Control R5 / R2 Scalar Cores DRAM GPU Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 15

48 Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Single Instruction Multiple Threads (SIMT) A single control unit for several computing units SIMT is distinct from SIMD Allows flows to diverge... but better avoid it! GPU Streaming Multiprocessor Control... if(cond){ } else { }... R1 + R2 Scalar Cores (Streaming Processors) Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 15

49 GPU Hardware Model CPU CPU vs GPU Multiple strategies for multiple purposes CPU Strategy Large caches Large control Purpose Complex codes, branching Complex memory access patterns World Rally Championship car GPU Strategy Lot of computing power Simplified control Purpose Regular data parallel codes Simple memory access patterns Formula One car Control Cache DRAM DRAM ALU ALU ALU ALU GPU Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 16

50 GPU Software Model (SIMT) Kernels enclosed in implicit loop Iteration space One kernel instance for each space point Threads Execute work simultaneously Specific language NVidia CUDA OpenCL Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 17

51 GPU Software Model (SIMT) Kernels enclosed in implicit loop Iteration space One kernel instance for each space point Threads Execute work simultaneously Specific language NVidia CUDA OpenCL 1 global void 2 vecadd ( f l o a t A, 3 f l o a t B, 4 f l o a t C) { 5 i n t i = threadidx. x ; 6 C[ i ] = A [ i ]+B [ i ] ; 7 } 8 9 i n t 10 main ( ) { / / vecadd <<<1,NB>>> ( A, B,C) ; 13 for ( threadidx. x = 0; 14 threadidx. x < NB; 15 threadidx. x++) { 16 vecadd (A, B,C) ; 17 } } Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 17

52 GPU Software Model (SIMT) Hardware Abstraction Scalar core Execute instances of a kernel The thread executing a given instance is identified by the threadidx variable { // i = threadidx.x { { { int i = 0; int i = 1; int i = 2; int i = 3; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; } } } } 1 global void 2 vecadd ( f l o a t A, 3 f l o a t B, 4 f l o a t C) { 5 i n t i = threadidx. x ; 6 C[ i ] = A [ i ]+B [ i ] ; 7 } 8 9 i n t 10 main ( ) { / / vecadd <<<1,NB>>> ( A, B,C) ; 13 for ( threadidx. x = 0; 14 threadidx. x < NB; 15 threadidx. x++) { 16 vecadd (A, B,C) ; 17 } } Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 18

53 Manycores Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19

54 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19

55 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19

56 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19

57 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries... but no free lunch Kernels and applications need optimizing work Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19

58 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries... but no free lunch Kernels and applications need optimizing work Discrete accelerator cards (for now!) Transfer data to card memory Transfer results back to main memory Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19

59 3Parallel Programming Models Team Storm Olivier Aumage Runtime Systems 20

60 Parallel Programming Models Languages Directive-based languages Specialized languages PGAS Languages... Libraries Linear algebra FFT... Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 21

61 Directive-Based Languages - Cilk Programming environment A language and compiler: keyword-based extension of C An execution model and a run-time system Recursive parallelism Divide-and-Conquer model Initially developed at the MIT Supertech Research Group Charles E. Leiserson s team Mid-90 s Now developed by Intel Available in ICC, GNU GCC Experimental version in LLVM/CLang 1 i n t f i b o ( i n t n ) 2 i n t r ; 3 i f ( n < 2) 4 r = n ; 5 else 6 i n t x, y ; 7 x = f i b o ( n 1) ; 8 y = f i b o ( n 2) ; 9 10 r = x + y ; 11 return r ; Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 22

62 Directive-Based Languages - Cilk Programming environment A language and compiler: keyword-based extension of C An execution model and a run-time system Recursive parallelism 1 c i l k i n t f i b o ( i n t n ) Divide-and-Conquer model 2 i n t r ; 3 i f ( n < 2) 4 r = n ; 5 Initially developed at the MIT else 6 i n t x, y ; Supertech Research Group 7 spawn x = f i b o ( n 1) ; Charles E. Leiserson s team 8 spawn y = f i b o ( n 2) ; Mid-90 s 9 sync Now developed by Intel 10 r = x + y ; Available in ICC, GNU GCC Experimental version in LLVM/CLang 11 return r ; Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 22

63 Directive-Based Languages - OpenMP Iterative parallelism Parallel section Team of threads 1 i n t i ; { 5 6 for ( i = 0; i < N; i ++) { 7 C[ i ] = A [ i ] + B [ i ] ; 8 } 9 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 23

64 Directive-Based Languages - OpenMP Iterative parallelism Parallel section Team of threads 1 i n t i ; 2 3 #pragma omp p a r a l l e l 4 { 5 #pragma omp for 6 for ( i = 0; i < N; i ++) { 7 C[ i ] = A [ i ] + B [ i ] ; 8 } 9 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 23

65 Directive-Based Languages - OpenMP Iterative parallelism Parallel section Team of threads Task parallelism, recursive parallelism OpenMP 3.0 Task dependencies, accelerators OpenMP l i s t p t r = l i s t _ h e a d ; { 5 6 while ( p t r!= NULL) { 7 void data = p t r >data ; { 12 process ( data ) ; 13 } p t r = p t r >next ; 16 } } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 23

66 Directive-Based Languages - OpenMP Iterative parallelism Parallel section Team of threads Task parallelism, recursive parallelism OpenMP 3.0 Task dependencies, accelerators OpenMP l i s t p t r = l i s t _ h e a d ; 2 3 #pragma omp p a r a l l e l 4 { 5 #pragma omp single 6 while ( p t r!= NULL) { 7 void data = p t r >data ; 8 9 #pragma omp task \ 10 f i r s t _ p r i v a t e ( data ) 11 { 12 process ( data ) ; 13 } p t r = p t r >next ; 16 } #pragma omp taskwait 19 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 23

67 PGAS Languages UPC Partitioned Global Address Space Unified Parallel C Global shared data Data distribution Parallel loops Threads Task extensions UPC Task Library 1 / / 2 3 i n t a [THREADS ] [ THREADS ] ; 4 i n t b [THREADS ] ; 5 i n t c [THREADS ] ; 6 i n t i, j ; 7 8 for ( i =0; i <THREADS; i ++) { 9 c [ i ] = 0 ; 10 for ( j =0; j <THREADS; j ++) { 11 c [ i ] += a [ i ] [ j ] b [ j ] ; 12 } 13 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 24

68 PGAS Languages UPC Partitioned Global Address Space Unified Parallel C Global shared data Data distribution Parallel loops Threads Task extensions UPC Task Library 1 # include <upc_relaxed. h> 2 3 shared [THREADS] i n t a [THREADS ] [ THREADS ] ; 4 shared i n t b [THREADS ] ; 5 shared i n t c [THREADS ] ; 6 i n t i, j ; 7 8 upc_forall ( i =0; i <THREADS; i ++; i ) { 9 c [ i ] = 0 ; 10 for ( j =0; j <THREADS; j ++) { 11 c [ i ] += a [ i ] [ j ] b [ j ] ; 12 } 13 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 24

69 Libraries Specialized libraries Black-box parallelism Linear Algebra BLAS, LAPACK Intel MKL, MAGMA, PLASMA Signal Processing FFTW, Spiral... Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 25

70 Common Denominator Many similar fundamental services Lower-level layer Abstraction/optimization layer Computing Runtime System Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 26

71 Common Denominator Many similar fundamental services Lower-level layer Abstraction/optimization layer Computing Runtime System Mapping work on computing resources Resolving trade-offs Optimizing Scheduling Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 26

72 4Computing Runtime Systems Team Storm Olivier Aumage Runtime Systems 27

73 Computing Runtime Systems Two classes Thread scheduling Task scheduling Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 28

74 Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 29

75 Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Examples Nowadays: libpthread Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 29

76 Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Examples Nowadays: libpthread Discussion Flexibility Resource consumption? Adaptiveness? Synchronization? Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 29

77 Task Scheduling Task Elementary computation Potential parallel work Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30

78 Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30

79 Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Recursive tasks vs non-blocking tasks Dependency management Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30

80 Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Recursive tasks vs non-blocking tasks Dependency management Examples StarPU Cilk s runtime, Intel Threading Building Blocks (TBB) StarSS / OmpSs PaRSEC... Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30

81 Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Recursive tasks vs non-blocking tasks Dependency management Examples StarPU Cilk s runtime, Intel Threading Building Blocks (TBB) StarSS / OmpSs PaRSEC... Discussion Abstraction Adaptiveness Transparent synchronization using dependencies Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30

82 Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 31

83 Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Communicate with discrete accelerator board(s) Send computation requests Send data to be processed Fetch results back Expensive Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 31

84 Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Communicate with discrete accelerator board(s) Send computation requests Send data to be processed Fetch results back Expensive Decide about worthiness Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 31

85 Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Communicate with discrete accelerator board(s) Send computation requests Send data to be processed Fetch results back Expensive Decide about worthiness See StarPU talk Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 31

86 Computing Runtimes Ecosystem Scheduling and Memory-Management Data transfers: CPU <-> discrete accelerator Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 32

87 Computing Runtimes Ecosystem Scheduling and Memory-Management Data transfers: CPU <-> discrete accelerator Minimize transfers Overlap transfers and requests with computation Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 32

88 Computing Runtimes Ecosystem Scheduling and Memory-Management Data transfers: CPU <-> discrete accelerator Minimize transfers Overlap transfers and requests with computation Cooperation with a Distributed Shared Memory system Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 32

89 Computing Runtimes Ecosystem Scheduling and Networking Distributed computing Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 33

90 Computing Runtimes Ecosystem Scheduling and Networking Distributed computing Interoperability, minimization, overlap Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 33

91 Computing Runtimes Ecosystem Scheduling and Networking Distributed computing Interoperability, minimization, overlap Cooperation with a network library MPI, Global Arrays, etc. Anticipate communication needs Merge multiple requests Throttle/alter scheduling with network events Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 33

92 Computing Runtimes Ecosystem Scheduling and I/O Out-of-core Very large computations Temporary storing large data structures on disk Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 34

93 Computing Runtimes Ecosystem Scheduling and I/O Out-of-core Very large computations Temporary storing large data structures on disk Interoperability, minimization, overlap Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 34

94 Computing Runtimes Ecosystem Scheduling and I/O Out-of-core Very large computations Temporary storing large data structures on disk Interoperability, minimization, overlap Cooperation with an I/O library When to store some data on disk? When to fetch it back? Heuristics Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 34

95 Computing Runtimes Ecosystem Scheduling, and Scheduling Theory Algorithmic Designing scheduling algorithms Testing scheduling algorithms in real life Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 35

96 Computing Runtimes Ecosystem Scheduling, and Scheduling Theory Algorithmic Designing scheduling algorithms Testing scheduling algorithms in real life Computing Runtimes as an interface framework Plug new algorithms Keep same interface Transparent for application Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 35

97 Conclusion Runtimes as interface frameworks Portability Control Adaptiveness Optimization Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 36

98 Conclusion Runtimes as interface frameworks Portability Control Adaptiveness Optimization Portability of Performance Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 36

99 Program of the Training Session Thursday, June 04: 09:00 (09:30) - 10:00: Introduction to Runtime Systems Olivier Aumage... coffee break... 10:15-12:00: The StarPU computing runtime (Part I) Olivier Aumage, Nathalie Furmento, Samuel Thibault... lunch break... 14:00-16:00: The Eztrace framework for performance debugging (Part I) Matias Hastaran, François Rué Friday, June 05: 09:00-11:00: The hardware locality library (hwloc) Brice Goglin... coffee break... 11:15-12:45: A process placement framework TreeMatch for multicore clusters Emmanuel Jeannot... lunch break... 14:00-16:00: The StarPU computing runtime (Part II) Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 37

The StarPU Runtime System

The StarPU Runtime System The StarPU Runtime System A Unified Runtime System for Heterogeneous Architectures Olivier Aumage STORM Team Inria LaBRI http://starpu.gforge.inria.fr/ 1Introduction Olivier Aumage STORM Team The StarPU

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

CSC573: TSHA Introduction to Accelerators

CSC573: TSHA Introduction to Accelerators CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

45-year CPU Evolution: 1 Law -2 Equations

45-year CPU Evolution: 1 Law -2 Equations 4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there

More information

Parallel Programming on Larrabee. Tim Foley Intel Corp

Parallel Programming on Larrabee. Tim Foley Intel Corp Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This

More information

StarPU: a runtime system for multigpu multicore machines

StarPU: a runtime system for multigpu multicore machines StarPU: a runtime system for multigpu multicore machines Raymond Namyst RUNTIME group, INRIA Bordeaux Journées du Groupe Calcul Lyon, November 2010 The RUNTIME Team High Performance Runtime Systems for

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc. Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing

More information

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance

More information

SIMD Exploitation in (JIT) Compilers

SIMD Exploitation in (JIT) Compilers SIMD Exploitation in (JIT) Compilers Hiroshi Inoue, IBM Research - Tokyo 1 What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register input 1 A0 input

More information

Why you should care about hardware locality and how.

Why you should care about hardware locality and how. Why you should care about hardware locality and how. Brice Goglin TADaaM team Inria Bordeaux Sud-Ouest Agenda Quick example as an introduction Bind your processes What's the actual problem? Convenient

More information

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest The Heterogeneous Programming Jungle Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest June 19, 2012 Outline 1. Introduction 2. Heterogeneous System Zoo 3. Similarities 4. Programming

More information

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 /CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

CS420: Operating Systems

CS420: Operating Systems Threads James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz, Galvin, Gagne Threads A thread is a basic unit of processing

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber, HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications:

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

Preparing seismic codes for GPUs and other

Preparing seismic codes for GPUs and other Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

Introduction II. Overview

Introduction II. Overview Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming

More information

OpenMP 4.0. Mark Bull, EPCC

OpenMP 4.0. Mark Bull, EPCC OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

OpenMP 4.0/4.5. Mark Bull, EPCC

OpenMP 4.0/4.5. Mark Bull, EPCC OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms. Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading

More information

Some features of modern CPUs. and how they help us

Some features of modern CPUs. and how they help us Some features of modern CPUs and how they help us RAM MUL core Wide operands RAM MUL core CP1: hardware can multiply 64-bit floating-point numbers Pipelining: can start the next independent operation before

More information

General introduction: GPUs and the realm of parallel architectures

General introduction: GPUs and the realm of parallel architectures General introduction: GPUs and the realm of parallel architectures GPU Computing Training August 17-19 th 2015 Jan Lemeire (jan.lemeire@vub.ac.be) Graduated as Engineer in 1994 at VUB Worked for 4 years

More information

AutoTune Workshop. Michael Gerndt Technische Universität München

AutoTune Workshop. Michael Gerndt Technische Universität München AutoTune Workshop Michael Gerndt Technische Universität München AutoTune Project Automatic Online Tuning of HPC Applications High PERFORMANCE Computing HPC application developers Compute centers: Energy

More information

Parallel Systems. Project topics

Parallel Systems. Project topics Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a

More information

A low memory footprint OpenCL simulation of short-range particle interactions

A low memory footprint OpenCL simulation of short-range particle interactions A low memory footprint OpenCL simulation of short-range particle interactions Raymond Namyst STORM INRIA Group With Samuel Pitoiset, Inria and Emmanuel Cieren, Laurent Colombet, Laurent Soulard, CEA/DAM/DPTA

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012

More information

CS516 Programming Languages and Compilers II

CS516 Programming Languages and Compilers II CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Jan 22 Overview and GPU Programming I Rutgers University CS516 Course Information Staff Instructor: zheng zhang (eddy.zhengzhang@cs.rutgers.edu)

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

Programmer's View of Execution Teminology Summary

Programmer's View of Execution Teminology Summary CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 28: GP-GPU Programming GPUs Hardware specialized for graphics calculations Originally developed to facilitate the use of CAD programs

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

Intel Software Development Products for High Performance Computing and Parallel Programming

Intel Software Development Products for High Performance Computing and Parallel Programming Intel Software Development Products for High Performance Computing and Parallel Programming Multicore development tools with extensions to many-core Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

! XKaapi : a runtime for highly parallel (OpenMP) application

! XKaapi : a runtime for highly parallel (OpenMP) application XKaapi : a runtime for highly parallel (OpenMP) application Thierry Gautier thierry.gautier@inrialpes.fr MOAIS, INRIA, Grenoble C2S@Exa 10-11/07/2014 Agenda - context - objective - task model and execution

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS CIS 601 - Graduate Seminar Presentation 1 GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS PRESENTED BY HARINATH AMASA CSU ID: 2697292 What we will talk about.. Current problems GPU What are GPU Databases GPU

More information

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)

More information

Vector Processors and Graphics Processing Units (GPUs)

Vector Processors and Graphics Processing Units (GPUs) Vector Processors and Graphics Processing Units (GPUs) Many slides from: Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley TA Evaluations Please fill out your

More information

COSC 6339 Accelerators in Big Data

COSC 6339 Accelerators in Big Data COSC 6339 Accelerators in Big Data Edgar Gabriel Fall 2018 Motivation Programming models such as MapReduce and Spark provide a high-level view of parallelism not easy for all problems, e.g. recursive algorithms,

More information

Parallel Programming on Ranger and Stampede

Parallel Programming on Ranger and Stampede Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

Getting Started with Intel SDK for OpenCL Applications

Getting Started with Intel SDK for OpenCL Applications Getting Started with Intel SDK for OpenCL Applications Webinar #1 in the Three-part OpenCL Webinar Series July 11, 2012 Register Now for All Webinars in the Series Welcome to Getting Started with Intel

More information

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de

More information

Lecture 11: GPU programming

Lecture 11: GPU programming Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!

More information

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion

More information

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008 Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

Scientific Computing on GPUs: GPU Architecture Overview

Scientific Computing on GPUs: GPU Architecture Overview Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

COSC 6385 Computer Architecture. - Data Level Parallelism (II) COSC 6385 Computer Architecture - Data Level Parallelism (II) Fall 2013 SIMD Instructions Originally developed for Multimedia applications Same operation executed for multiple data items Uses a fixed length

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. In-Depth

Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. In-Depth Intel C++ Compiler Professional Edition 11.1 for Mac OS* X In-Depth Contents Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. 3 Intel C++ Compiler Professional Edition 11.1 Components:...3 Features...3

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

Performance of deal.ii on a node

Performance of deal.ii on a node Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information