Introduction to Runtime Systems
|
|
- Beryl Lang
- 5 years ago
- Views:
Transcription
1 Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation
2 Contents 1. Introduction 2. Computing Hardware 3. Parallel Programming Models 4. Computing Runtime Systems Team Storm Olivier Aumage Runtime Systems 2
3 1Introduction Team Storm Olivier Aumage Runtime Systems 3
4 Hardware Evolution More capabilities, more complexity Team Storm Olivier Aumage Runtime Systems 1. Introduction 4
5 Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Team Storm Olivier Aumage Runtime Systems 1. Introduction 4
6 Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing Team Storm Olivier Aumage Runtime Systems 1. Introduction 4
7 Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing I/O RAID SSD vs Disks Network-attached disks Parallel file systems Team Storm Olivier Aumage Runtime Systems 1. Introduction 4
8 Hardware Evolution More capabilities, more complexity Graphics Higher resolutions 2D acceleration 3D rendering Networking Processing offload Zero-copy transfers Hardware multiplexing I/O RAID SSD vs Disks Network-attached disks Parallel file systems Computing Multiprocessors, multicores Vector processing extensions Accelerators Team Storm Olivier Aumage Runtime Systems 1. Introduction 4
9 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Team Storm Olivier Aumage Runtime Systems 1. Introduction 5
10 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5
11 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Team Storm Olivier Aumage Runtime Systems 1. Introduction 5
12 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5
13 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5
14 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5
15 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5
16 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Vendor-tied code? Team Storm Olivier Aumage Runtime Systems 1. Introduction 5
17 Dilemma for the Application Programmer Stay conservative? Only use standards Only use long established features Sequential programming Common Unix systems calls TCP sockets Under-used hardware? Low performance? Use tempting, bleeding edges features? Efficiency Convenience Portability? Adaptiveness? Cost? Long-term viability? Vendor-tied code? Use runtime systems! Team Storm Olivier Aumage Runtime Systems 1. Introduction 5
18 The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Team Storm Olivier Aumage Runtime Systems 1. Introduction 6
19 The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Team Storm Olivier Aumage Runtime Systems 1. Introduction 6
20 The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Adaptiveness Load balancing Monitoring, sampling, calibrating Team Storm Olivier Aumage Runtime Systems 1. Introduction 6
21 The Role(s) of Runtime Systems Portability Abstraction Drivers, plugins Control Resource mapping Scheduling Adaptiveness Load balancing Monitoring, sampling, calibrating Optimization Requests aggregation Resource locality Computation offload Computation/transfer overlap Team Storm Olivier Aumage Runtime Systems 1. Introduction 6
22 Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Team Storm Olivier Aumage Runtime Systems 1. Introduction 7
23 Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL Team Storm Olivier Aumage Runtime Systems 1. Introduction 7
24 Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL I/O MPI-IO Database engines (Google LevelDB) Team Storm Olivier Aumage Runtime Systems 1. Introduction 7
25 Examples of Runtime Systems Networking MPI (Message Passing Interface), Global Arrays CCI (Common Communication Interface) Distributed Shared Memory systems Graphics DirectX, Direct3D (Microsoft Windows) OpenGL I/O MPI-IO Database engines (Google LevelDB) Computing runtime systems?... Team Storm Olivier Aumage Runtime Systems 1. Introduction 7
26 2Computing Hardware Team Storm Olivier Aumage Runtime Systems 8
27 Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 9
28 Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Hardware Parallelism Multiply existing processing power Have several processing units work together Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 9
29 Evolution of Computing Hardware Rupture The Frequency Wall Processing units cannot run anymore faster Looking for other sources of performance Hardware Parallelism Multiply existing processing power Have several processing units work together Not a new idea but becoming the key performance factor Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 9
30 Processor Parallelisms Various forms of hardware parallelism Multiprocessors Multicores Hardware multithreading (SMT) Vector processing (SIMD) Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 10
31 Processor Parallelisms Various forms of hardware parallelism Multiprocessors Multicores Hardware multithreading (SMT) Vector processing (SIMD) Multiple forms may be combined Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 10
32 Multiprocessors and Multicores Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 11
33 Multiprocessors and Multicores Multiprocessors Full processor replicates Rationale: Share node contents Share memory and devices Memory sharing may involve non-uniformity See upcoming hwloc and TreeMatch talks...! Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 11
34 Multiprocessors and Multicores Multiprocessors Full processor replicates Rationale: Share node contents Share memory and devices Memory sharing may involve non-uniformity See upcoming hwloc and TreeMatch talks...! Multicores Processor circuit replicates (cores) printed on the same dye Rationale: Use available dye area for more processing power Shrinking process Share memory and devices May share some additional dye circuitry (cache(s), uncore services) See upcoming hwloc and TreeMatch talks...! Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 11
35 Multiprocessors and Multicores Taking advantage of them? Needs multiple parallel application activities Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 12
36 Multiprocessors and Multicores Taking advantage of them? Needs multiple parallel application activities Additional considerations Availability Work mapping issues Locality issues Memory bandwidth issues Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 12
37 Hardware Multithreading Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 13
38 Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 13
39 Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Taking advantage of it? Needs multiple parallel application activities Highly dependent of application activities characteristics Complementary vs competitive Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 13
40 Hardware Multithreading Simultaneous Multithreading (SMT) Multiple processing contexts managed by the same core Enables interleaving multiple threads on the same core Rationale Try to fill more computing units (e.g. int + float) Hide memory/cache latency Taking advantage of it? Needs multiple parallel application activities Highly dependent of application activities characteristics Complementary vs competitive Additional considerations Availability Work mapping issues Locality issues Memory bandwidth issues Benefit vs loss Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 13
41 Vector Processing Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 14
42 Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 14
43 Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Taking advantage of it? Specially written kernels Compiler Use of assembly language Intrinsics Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 14
44 Vector Processing Single Instruction, Multiple Data (SIMD) Apply an instruction on multiple data simultaneously Enables repeating simple operations on array elements Rationale: Share instruction decoding between several data elements Taking advantage of it? Specially written kernels Compiler Use of assembly language Intrinsics Additional considerations Availability Feature set/variants MMX 3dnow! SSE [2...5] AVX... Benefit vs loss Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 14
45 Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 15
46 Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Single Instruction Multiple Threads (SIMT) A single control unit for several computing units Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 15
47 Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Streaming Multiprocessor Control R1 + R2 Scalar Cores (Streaming Processors) Single Instruction Multiple Threads (SIMT) A single control unit for several computing units Control R5 / R2 Scalar Cores DRAM GPU Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 15
48 Accelerators Special purpose computing devices (or general purpose GPUs) (initially) a discrete expansion card Rationale: dye area trade-off Single Instruction Multiple Threads (SIMT) A single control unit for several computing units SIMT is distinct from SIMD Allows flows to diverge... but better avoid it! GPU Streaming Multiprocessor Control... if(cond){ } else { }... R1 + R2 Scalar Cores (Streaming Processors) Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 15
49 GPU Hardware Model CPU CPU vs GPU Multiple strategies for multiple purposes CPU Strategy Large caches Large control Purpose Complex codes, branching Complex memory access patterns World Rally Championship car GPU Strategy Lot of computing power Simplified control Purpose Regular data parallel codes Simple memory access patterns Formula One car Control Cache DRAM DRAM ALU ALU ALU ALU GPU Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 16
50 GPU Software Model (SIMT) Kernels enclosed in implicit loop Iteration space One kernel instance for each space point Threads Execute work simultaneously Specific language NVidia CUDA OpenCL Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 17
51 GPU Software Model (SIMT) Kernels enclosed in implicit loop Iteration space One kernel instance for each space point Threads Execute work simultaneously Specific language NVidia CUDA OpenCL 1 global void 2 vecadd ( f l o a t A, 3 f l o a t B, 4 f l o a t C) { 5 i n t i = threadidx. x ; 6 C[ i ] = A [ i ]+B [ i ] ; 7 } 8 9 i n t 10 main ( ) { / / vecadd <<<1,NB>>> ( A, B,C) ; 13 for ( threadidx. x = 0; 14 threadidx. x < NB; 15 threadidx. x++) { 16 vecadd (A, B,C) ; 17 } } Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 17
52 GPU Software Model (SIMT) Hardware Abstraction Scalar core Execute instances of a kernel The thread executing a given instance is identified by the threadidx variable { // i = threadidx.x { { { int i = 0; int i = 1; int i = 2; int i = 3; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; C[i] = A[i]+B[i]; } } } } 1 global void 2 vecadd ( f l o a t A, 3 f l o a t B, 4 f l o a t C) { 5 i n t i = threadidx. x ; 6 C[ i ] = A [ i ]+B [ i ] ; 7 } 8 9 i n t 10 main ( ) { / / vecadd <<<1,NB>>> ( A, B,C) ; 13 for ( threadidx. x = 0; 14 threadidx. x < NB; 15 threadidx. x++) { 16 vecadd (A, B,C) ; 17 } } Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 18
53 Manycores Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19
54 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19
55 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19
56 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19
57 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries... but no free lunch Kernels and applications need optimizing work Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19
58 Manycores Intel SCC 48 cores (P54C Pentium) No cache coherence Communication library Intel Xeon Phi/MIC 61 cores (P54C Pentium) 4 hardware threads per core AVX 512-bit SIMD instruction set Cache coherence Classical programming tool-chain... Compilers, libraries... but no free lunch Kernels and applications need optimizing work Discrete accelerator cards (for now!) Transfer data to card memory Transfer results back to main memory Team Storm Olivier Aumage Runtime Systems 2. Computing Hardware 19
59 3Parallel Programming Models Team Storm Olivier Aumage Runtime Systems 20
60 Parallel Programming Models Languages Directive-based languages Specialized languages PGAS Languages... Libraries Linear algebra FFT... Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 21
61 Directive-Based Languages - Cilk Programming environment A language and compiler: keyword-based extension of C An execution model and a run-time system Recursive parallelism Divide-and-Conquer model Initially developed at the MIT Supertech Research Group Charles E. Leiserson s team Mid-90 s Now developed by Intel Available in ICC, GNU GCC Experimental version in LLVM/CLang 1 i n t f i b o ( i n t n ) 2 i n t r ; 3 i f ( n < 2) 4 r = n ; 5 else 6 i n t x, y ; 7 x = f i b o ( n 1) ; 8 y = f i b o ( n 2) ; 9 10 r = x + y ; 11 return r ; Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 22
62 Directive-Based Languages - Cilk Programming environment A language and compiler: keyword-based extension of C An execution model and a run-time system Recursive parallelism 1 c i l k i n t f i b o ( i n t n ) Divide-and-Conquer model 2 i n t r ; 3 i f ( n < 2) 4 r = n ; 5 Initially developed at the MIT else 6 i n t x, y ; Supertech Research Group 7 spawn x = f i b o ( n 1) ; Charles E. Leiserson s team 8 spawn y = f i b o ( n 2) ; Mid-90 s 9 sync Now developed by Intel 10 r = x + y ; Available in ICC, GNU GCC Experimental version in LLVM/CLang 11 return r ; Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 22
63 Directive-Based Languages - OpenMP Iterative parallelism Parallel section Team of threads 1 i n t i ; { 5 6 for ( i = 0; i < N; i ++) { 7 C[ i ] = A [ i ] + B [ i ] ; 8 } 9 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 23
64 Directive-Based Languages - OpenMP Iterative parallelism Parallel section Team of threads 1 i n t i ; 2 3 #pragma omp p a r a l l e l 4 { 5 #pragma omp for 6 for ( i = 0; i < N; i ++) { 7 C[ i ] = A [ i ] + B [ i ] ; 8 } 9 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 23
65 Directive-Based Languages - OpenMP Iterative parallelism Parallel section Team of threads Task parallelism, recursive parallelism OpenMP 3.0 Task dependencies, accelerators OpenMP l i s t p t r = l i s t _ h e a d ; { 5 6 while ( p t r!= NULL) { 7 void data = p t r >data ; { 12 process ( data ) ; 13 } p t r = p t r >next ; 16 } } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 23
66 Directive-Based Languages - OpenMP Iterative parallelism Parallel section Team of threads Task parallelism, recursive parallelism OpenMP 3.0 Task dependencies, accelerators OpenMP l i s t p t r = l i s t _ h e a d ; 2 3 #pragma omp p a r a l l e l 4 { 5 #pragma omp single 6 while ( p t r!= NULL) { 7 void data = p t r >data ; 8 9 #pragma omp task \ 10 f i r s t _ p r i v a t e ( data ) 11 { 12 process ( data ) ; 13 } p t r = p t r >next ; 16 } #pragma omp taskwait 19 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 23
67 PGAS Languages UPC Partitioned Global Address Space Unified Parallel C Global shared data Data distribution Parallel loops Threads Task extensions UPC Task Library 1 / / 2 3 i n t a [THREADS ] [ THREADS ] ; 4 i n t b [THREADS ] ; 5 i n t c [THREADS ] ; 6 i n t i, j ; 7 8 for ( i =0; i <THREADS; i ++) { 9 c [ i ] = 0 ; 10 for ( j =0; j <THREADS; j ++) { 11 c [ i ] += a [ i ] [ j ] b [ j ] ; 12 } 13 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 24
68 PGAS Languages UPC Partitioned Global Address Space Unified Parallel C Global shared data Data distribution Parallel loops Threads Task extensions UPC Task Library 1 # include <upc_relaxed. h> 2 3 shared [THREADS] i n t a [THREADS ] [ THREADS ] ; 4 shared i n t b [THREADS ] ; 5 shared i n t c [THREADS ] ; 6 i n t i, j ; 7 8 upc_forall ( i =0; i <THREADS; i ++; i ) { 9 c [ i ] = 0 ; 10 for ( j =0; j <THREADS; j ++) { 11 c [ i ] += a [ i ] [ j ] b [ j ] ; 12 } 13 } Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 24
69 Libraries Specialized libraries Black-box parallelism Linear Algebra BLAS, LAPACK Intel MKL, MAGMA, PLASMA Signal Processing FFTW, Spiral... Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 25
70 Common Denominator Many similar fundamental services Lower-level layer Abstraction/optimization layer Computing Runtime System Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 26
71 Common Denominator Many similar fundamental services Lower-level layer Abstraction/optimization layer Computing Runtime System Mapping work on computing resources Resolving trade-offs Optimizing Scheduling Team Storm Olivier Aumage Runtime Systems 3. Parallel Programming Models 26
72 4Computing Runtime Systems Team Storm Olivier Aumage Runtime Systems 27
73 Computing Runtime Systems Two classes Thread scheduling Task scheduling Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 28
74 Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 29
75 Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Examples Nowadays: libpthread Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 29
76 Thread Scheduling Thread Unbounded parallel activity One state/context per thread Variants Cooperative multithreading Preemptive multithreading Examples Nowadays: libpthread Discussion Flexibility Resource consumption? Adaptiveness? Synchronization? Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 29
77 Task Scheduling Task Elementary computation Potential parallel work Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30
78 Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30
79 Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Recursive tasks vs non-blocking tasks Dependency management Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30
80 Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Recursive tasks vs non-blocking tasks Dependency management Examples StarPU Cilk s runtime, Intel Threading Building Blocks (TBB) StarSS / OmpSs PaRSEC... Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30
81 Task Scheduling Task Elementary computation Potential parallel work No dedicated state Internal set of worker threads Variants Recursive tasks vs non-blocking tasks Dependency management Examples StarPU Cilk s runtime, Intel Threading Building Blocks (TBB) StarSS / OmpSs PaRSEC... Discussion Abstraction Adaptiveness Transparent synchronization using dependencies Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 30
82 Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 31
83 Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Communicate with discrete accelerator board(s) Send computation requests Send data to be processed Fetch results back Expensive Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 31
84 Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Communicate with discrete accelerator board(s) Send computation requests Send data to be processed Fetch results back Expensive Decide about worthiness Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 31
85 Heterogeneous Task Scheduling Scheduling on platform equipped with accelerators Adapting to heterogeneity Decide about tasks to offload Decide about tasks to keep on CPU Communicate with discrete accelerator board(s) Send computation requests Send data to be processed Fetch results back Expensive Decide about worthiness See StarPU talk Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 31
86 Computing Runtimes Ecosystem Scheduling and Memory-Management Data transfers: CPU <-> discrete accelerator Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 32
87 Computing Runtimes Ecosystem Scheduling and Memory-Management Data transfers: CPU <-> discrete accelerator Minimize transfers Overlap transfers and requests with computation Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 32
88 Computing Runtimes Ecosystem Scheduling and Memory-Management Data transfers: CPU <-> discrete accelerator Minimize transfers Overlap transfers and requests with computation Cooperation with a Distributed Shared Memory system Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 32
89 Computing Runtimes Ecosystem Scheduling and Networking Distributed computing Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 33
90 Computing Runtimes Ecosystem Scheduling and Networking Distributed computing Interoperability, minimization, overlap Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 33
91 Computing Runtimes Ecosystem Scheduling and Networking Distributed computing Interoperability, minimization, overlap Cooperation with a network library MPI, Global Arrays, etc. Anticipate communication needs Merge multiple requests Throttle/alter scheduling with network events Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 33
92 Computing Runtimes Ecosystem Scheduling and I/O Out-of-core Very large computations Temporary storing large data structures on disk Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 34
93 Computing Runtimes Ecosystem Scheduling and I/O Out-of-core Very large computations Temporary storing large data structures on disk Interoperability, minimization, overlap Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 34
94 Computing Runtimes Ecosystem Scheduling and I/O Out-of-core Very large computations Temporary storing large data structures on disk Interoperability, minimization, overlap Cooperation with an I/O library When to store some data on disk? When to fetch it back? Heuristics Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 34
95 Computing Runtimes Ecosystem Scheduling, and Scheduling Theory Algorithmic Designing scheduling algorithms Testing scheduling algorithms in real life Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 35
96 Computing Runtimes Ecosystem Scheduling, and Scheduling Theory Algorithmic Designing scheduling algorithms Testing scheduling algorithms in real life Computing Runtimes as an interface framework Plug new algorithms Keep same interface Transparent for application Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 35
97 Conclusion Runtimes as interface frameworks Portability Control Adaptiveness Optimization Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 36
98 Conclusion Runtimes as interface frameworks Portability Control Adaptiveness Optimization Portability of Performance Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 36
99 Program of the Training Session Thursday, June 04: 09:00 (09:30) - 10:00: Introduction to Runtime Systems Olivier Aumage... coffee break... 10:15-12:00: The StarPU computing runtime (Part I) Olivier Aumage, Nathalie Furmento, Samuel Thibault... lunch break... 14:00-16:00: The Eztrace framework for performance debugging (Part I) Matias Hastaran, François Rué Friday, June 05: 09:00-11:00: The hardware locality library (hwloc) Brice Goglin... coffee break... 11:15-12:45: A process placement framework TreeMatch for multicore clusters Emmanuel Jeannot... lunch break... 14:00-16:00: The StarPU computing runtime (Part II) Team Storm Olivier Aumage Runtime Systems 4. Computing Runtime Systems 37
The StarPU Runtime System
The StarPU Runtime System A Unified Runtime System for Heterogeneous Architectures Olivier Aumage STORM Team Inria LaBRI http://starpu.gforge.inria.fr/ 1Introduction Olivier Aumage STORM Team The StarPU
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationCSC573: TSHA Introduction to Accelerators
CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More information45-year CPU Evolution: 1 Law -2 Equations
4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there
More informationParallel Programming on Larrabee. Tim Foley Intel Corp
Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This
More informationStarPU: a runtime system for multigpu multicore machines
StarPU: a runtime system for multigpu multicore machines Raymond Namyst RUNTIME group, INRIA Bordeaux Journées du Groupe Calcul Lyon, November 2010 The RUNTIME Team High Performance Runtime Systems for
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationTowards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.
Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing
More informationOpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer
OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance
More informationSIMD Exploitation in (JIT) Compilers
SIMD Exploitation in (JIT) Compilers Hiroshi Inoue, IBM Research - Tokyo 1 What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register input 1 A0 input
More informationWhy you should care about hardware locality and how.
Why you should care about hardware locality and how. Brice Goglin TADaaM team Inria Bordeaux Sud-Ouest Agenda Quick example as an introduction Bind your processes What's the actual problem? Convenient
More informationThe Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest
The Heterogeneous Programming Jungle Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest June 19, 2012 Outline 1. Introduction 2. Heterogeneous System Zoo 3. Similarities 4. Programming
More informationGPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3
/CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationCOMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES
COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationCS420: Operating Systems
Threads James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz, Galvin, Gagne Threads A thread is a basic unit of processing
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationThe Era of Heterogeneous Computing
The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationHPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,
HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications:
More informationParallel Programming Libraries and implementations
Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.
More informationPreparing seismic codes for GPUs and other
Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More informationIntroduction II. Overview
Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationShared-memory Parallel Programming with Cilk Plus
Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming
More informationOpenMP 4.0. Mark Bull, EPCC
OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!
More informationMulti-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
More informationOpenMP 4.0/4.5. Mark Bull, EPCC
OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationDesign of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017
Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationMaster Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading
More informationSome features of modern CPUs. and how they help us
Some features of modern CPUs and how they help us RAM MUL core Wide operands RAM MUL core CP1: hardware can multiply 64-bit floating-point numbers Pipelining: can start the next independent operation before
More informationGeneral introduction: GPUs and the realm of parallel architectures
General introduction: GPUs and the realm of parallel architectures GPU Computing Training August 17-19 th 2015 Jan Lemeire (jan.lemeire@vub.ac.be) Graduated as Engineer in 1994 at VUB Worked for 4 years
More informationAutoTune Workshop. Michael Gerndt Technische Universität München
AutoTune Workshop Michael Gerndt Technische Universität München AutoTune Project Automatic Online Tuning of HPC Applications High PERFORMANCE Computing HPC application developers Compute centers: Energy
More informationParallel Systems. Project topics
Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a
More informationA low memory footprint OpenCL simulation of short-range particle interactions
A low memory footprint OpenCL simulation of short-range particle interactions Raymond Namyst STORM INRIA Group With Samuel Pitoiset, Inria and Emmanuel Cieren, Laurent Colombet, Laurent Soulard, CEA/DAM/DPTA
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationArchitecture, Programming and Performance of MIC Phi Coprocessor
Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationTutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers
Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012
More informationCS516 Programming Languages and Compilers II
CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Jan 22 Overview and GPU Programming I Rutgers University CS516 Course Information Staff Instructor: zheng zhang (eddy.zhengzhang@cs.rutgers.edu)
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationProgrammer's View of Execution Teminology Summary
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 28: GP-GPU Programming GPUs Hardware specialized for graphics calculations Originally developed to facilitate the use of CAD programs
More informationThe Stampede is Coming: A New Petascale Resource for the Open Science Community
The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation
More informationIntel Software Development Products for High Performance Computing and Parallel Programming
Intel Software Development Products for High Performance Computing and Parallel Programming Multicore development tools with extensions to many-core Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationLecture 5. Performance programming for stencil methods Vectorization Computing with GPUs
Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More information! XKaapi : a runtime for highly parallel (OpenMP) application
XKaapi : a runtime for highly parallel (OpenMP) application Thierry Gautier thierry.gautier@inrialpes.fr MOAIS, INRIA, Grenoble C2S@Exa 10-11/07/2014 Agenda - context - objective - task model and execution
More informationGPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten
GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationGPU ACCELERATED DATABASE MANAGEMENT SYSTEMS
CIS 601 - Graduate Seminar Presentation 1 GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS PRESENTED BY HARINATH AMASA CSU ID: 2697292 What we will talk about.. Current problems GPU What are GPU Databases GPU
More informationPorting a parallel rotor wake simulation to GPGPU accelerators using OpenACC
DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)
More informationVector Processors and Graphics Processing Units (GPUs)
Vector Processors and Graphics Processing Units (GPUs) Many slides from: Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley TA Evaluations Please fill out your
More informationCOSC 6339 Accelerators in Big Data
COSC 6339 Accelerators in Big Data Edgar Gabriel Fall 2018 Motivation Programming models such as MapReduce and Spark provide a high-level view of parallelism not easy for all problems, e.g. recursive algorithms,
More informationParallel Programming on Ranger and Stampede
Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationGetting Started with Intel SDK for OpenCL Applications
Getting Started with Intel SDK for OpenCL Applications Webinar #1 in the Three-part OpenCL Webinar Series July 11, 2012 Register Now for All Webinars in the Series Welcome to Getting Started with Intel
More informationSolving Dense Linear Systems on Platforms with Multiple Hardware Accelerators
Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de
More informationLecture 11: GPU programming
Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!
More informationCenter for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop
Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion
More informationChallenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008
Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated
More informationShared memory programming model OpenMP TMA4280 Introduction to Supercomputing
Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started
More informationScientific Computing on GPUs: GPU Architecture Overview
Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationParallel Programming. Libraries and Implementations
Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationCOSC 6385 Computer Architecture. - Data Level Parallelism (II)
COSC 6385 Computer Architecture - Data Level Parallelism (II) Fall 2013 SIMD Instructions Originally developed for Multimedia applications Same operation executed for multiple data items Uses a fixed length
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationIntel C++ Compiler Professional Edition 11.1 for Mac OS* X. In-Depth
Intel C++ Compiler Professional Edition 11.1 for Mac OS* X In-Depth Contents Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. 3 Intel C++ Compiler Professional Edition 11.1 Components:...3 Features...3
More informationComputing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany
Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been
More informationPerformance of deal.ii on a node
Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More information