OpenMP API Version 5.0

Similar documents
OpenMP at Twenty Past, Present, and Future. Michael Klemm Chief Executive Officer OpenMP Architecture Review Board

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

Advanced OpenMP Features

OpenMP Tool Interfaces in OpenMP 5.0. Joachim Protze

OpenMP 4.5: Threading, vectorization & offloading

SC16 OpenMP* BoF Report (BoF 113) Jim Cownie, Michael Klemm 28 November 2016

Modernizing OpenMP for an Accelerated World

ECE 574 Cluster Computing Lecture 10

Progress on OpenMP Specifications

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

OpenMP 4.0/4.5. Mark Bull, EPCC

POSIX Threads and OpenMP tasks

Advanced programming with OpenMP. Libor Bukata a Jan Dvořák

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Shared memory parallel computing

Introduction to OpenMP

OpenMP 4 (and now 5.0)

OpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Parallel Programming

OpenMP tasking model for Ada: safety and correctness

OpenACC 2.6 Proposed Features

OpenMP 4.5 target. Presenters: Tom Scogland Oscar Hernandez. Wednesday, June 28 th, Credits for some of the material

Make the Most of OpenMP Tasking. Sergi Mateo Bellido Compiler engineer

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Introduction to OpenMP

OpenMP 4.0. Mark Bull, EPCC

Ruud van der Pas. Senior Principal So2ware Engineer SPARC Microelectronics. Santa Clara, CA, USA

OpenMP 4.0 (and now 5.0)

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Hands-on Clone instructions: bit.ly/ompt-handson. How to get most of OMPT (OpenMP Tools Interface)

Parallel Programming Overview

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

[Potentially] Your first parallel application

A brief introduction to OpenMP

JCudaMP: OpenMP/Java on CUDA

PRACE Autumn School Basic Programming Models

OpenACC Course. Office Hour #2 Q&A

Advanced OpenMP. Lecture 11: OpenMP 4.0

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Heterogeneous Computing and OpenCL

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Parallel Numerical Algorithms

Programming Shared-memory Platforms with OpenMP

State of OpenMP & Outlook on OpenMP 4.1

IWOMP Dresden, Germany

HPCSE - II. «OpenMP Programming Model - Tasks» Panos Hadjidoukas

OmpSs Specification. BSC Programming Models

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

Shared Memory Programming Models I

Lecture 4: OpenMP Open Multi-Processing

Intel Xeon Phi programming. September 22nd-23rd 2015 University of Copenhagen, Denmark

Tasking in OpenMP 4. Mirko Cestari - Marco Rorro -

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

Getting Performance from OpenMP Programs on NUMA Architectures

Parallel Accelerators

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Programming Shared-memory Platforms with OpenMP. Xu Liu

Parallel Programming in C with MPI and OpenMP

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)

Parallel Programming. Libraries and Implementations

Parallel Programming in C with MPI and OpenMP

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

Parallel Accelerators

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

COMP Parallel Computing. SMM (2) OpenMP Programming Model

An Introduction to OpenAcc

An Introduction to OpenMP

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

A Uniform Programming Model for Petascale Computing

Threaded Programming. Lecture 9: Alternatives to OpenMP

Open Multi-Processing: Basic Course

A short stroll along the road of OpenMP* 4.0

Parallel Computing. November 20, W.Homberg

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Intel Software Development Products for High Performance Computing and Parallel Programming

Shared Memory Programming Model

Programming Shared Memory Systems with OpenMP Part I. Book

OpenACC Fundamentals. Steve Abbott November 15, 2017

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to Runtime Systems

OpenACC (Open Accelerators - Introduced in 2012)

Parallel processing with OpenMP. #pragma omp

Shared Memory Parallelism - OpenMP

cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP)

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Implementation of Parallelization

Parallel Programming in C with MPI and OpenMP

Shared Memory Programming with OpenMP (3)

CS 470 Spring Mike Lam, Professor. OpenMP

Shared Memory Programming with OpenMP

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

Shared Memory programming paradigm: openmp

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Transcription:

OpenMP API Version 5.0 (or: Pretty Cool & New OpenMP Stuff) Michael Klemm Chief Executive Officer OpenMP Architecture Review Board michael.klemm@openmp.org

Architecture Review Board The mission of the OpenMP ARB (Architecture Review Board) is to standardize directive-based multi-language high-level parallelism that is performant, productive and portable. 2

Membership Structure ARB Member Highest membership category Participation in technical discussions and organizational decisions Voting rights on organizational topics Voting rights on technical topics (tickets, TRs, specifications) ARB Advisor & ARB Contributors Contribute to technical discussions Voting rights on technical topics (tickets, TRs, specifications) Your organization can join and influence the direction of OpenMP. Talk to me or send email to michael.klemm@openmp.org. 3

OpenMP Roadmap OpenMP has a well-defined roadmap: 5-year cadence for major releases One minor release in between (At least) one Technical Report (TR) with feature previews in every year TR6 OpenMP 5.0 TR8* OpenMP 5.x TR9* TR10* OpenMP 6.0 Nov 17 Nov 18 Nov 19 Nov 20 Nov 21 Nov 22 Nov 23 Public Comment Draft (TR7) * Numbers assigned to TRs may change if additional TRs are released. 4

Levels of Parallelism in OpenMP 4.5 Cluster Coprocessors/Accelerators Node Core Hyper-Threads Superscalar Pipeline Vector Group of computers communicating through fast interconnect Special compute devices attached to the local node through special interconnect Group of cache coherent processors communicating through shared memory/cache Group of functional units within a die communicating through registers Group of thread contexts sharing functional units Group of instructions sharing functional units Sequence of instructions sharing functional units Single instruction using multiple functional units 5

OpenMP Version 5.0 OpenMP 5.0 will introduce new powerful features to improve programmability Task Reductions C++14 and C++17 support loop Construct Multi-level Parallelism Parallel Scan Meta-directives Memory Allocators Task-to-data Affinity Dependence Objects Fortran 2008 support Improved Task Dependences Detachable Tasks Unified Shared Memory Collapse non-rect. Loops Data Serialization for Offload Function Variants Tools APIs Reverse Offloading

Task-to-data Affinity OpenMP API version 5.0 supports affinity hints for OpenMP tasks Core Core Core Core void task_affinity() double* B; #pragma omp task shared(b) B = init_b_and_important_computation(a); #pragma omp task firstprivate(b) important_computation_too(b); #pragma omp taskwait on-chip cache memory on-chip cache on-chip cache interconnect memory on-chip cache A[0] A[N] B[0] B[N] 7

Task-to-data Affinity OpenMP API version 5.0 supports affinity hints for OpenMP tasks Core Core Core Core void task_affinity() double* B; #pragma omp task shared(b) affinity(a[0:n]) B = init_b_and_important_computation(a); #pragma omp task firstprivate(b) affinity(b[0:n]) important_computation_too(b); #pragma omp taskwait on-chip cache memory on-chip cache on-chip cache interconnect memory on-chip cache A[0] A[N] B[0] B[N] 8

Task Reductions Task reductions extend traditional reductions to arbitrary task graphs Extend the existing task and taskgroup constructs Also work with the taskloop construct int res = 0; node_t* node = NULL;... #pragma omp parallel #pragma omp single #pragma omp taskgroup task_reduction(+: res) while (node) #pragma omp task in_reduction(+: res) \ firstprivate(node) res += node->value; node = node->next; 9

New Task Dependencies int x = 0, y = 0, res = 0; #pragma omp parallel #pragma omp single #pragma omp task depend(out: res) //T0 res = 0; T1 T0 T2 #pragma omp task depend(out: x) //T1 long_computation(x); #pragma omp task depend(out: y) //T2 short_computation(y); #pragma omp task depend(in: x) res += x; depend(mutexinoutset: depend(inout: res) //T3res) //T3 T3 T3 T4 T4 #pragma omp task depend(in: y) res += y; depend(inout: depend(mutexinoutset: res) //T4res) //T4 T5 #pragma omp task depend(in: res) //T5 std::cout << res << std::endl; 10

Asynchronous API Interaction Some APIs are based on asynchronous operations MPI asynchronous send and receive Asynchronous I/O CUDA and OpenCL stream-based offloading In general: any other API/model that executes asynchronously with OpenMP (tasks) Example: CUDA memory transfers do_something(); cudamemcpyasync(dst, src, nbytes, cudamemcpydevicetohost, stream); do_something_else(); cudastreamsynchronize(stream); do_other_important_stuff(dst); Programmers need a mechanism to marry asynchronous APIs with the parallel task model of OpenMP How to synchronize completions events with task execution? 11

Use OpenMP Tasks void cuda_example() #pragma omp task // task A do_something(); cudamemcpyasync(dst, src, nbytes, cudamemcpydevicetohost, stream); #pragma omp task // task B do_something_else(); #pragma omp task // task C cudastreamsynchronize(stream); do_other_important_stuff(dst); Race condition between the tasks A & C, task C may start execution before task A enqueues memory transfer. 12

Detaching Tasks omp_event_t *event; void detach_example() #pragma omp task detach(event) important_code(); #pragma omp taskwait 1. Task detaches 2. taskwait construct cannot complete Some other thread/task: omp_fulfill_event(event); 3. Signal event for completion 4. Task completes and taskwait can continue 13

Detachable Tasks and Asynchronous APIs void CUDART_CB callback(cudastream_t stream, cudaerror_t status, void *cb_dat) omp_fulfill_event((omp_event_t *) cb_data); void cuda_example() omp_event_t *cuda_event; #pragma omp task detach(cuda_event) // task A do_something(); cudamemcpyasync(dst, src, nbytes, cudamemcpydevicetohost, stream); cudastreamaddcallback(stream, callback, cuda_event, 0); #pragma omp task do_something_else(); // task B #pragma omp taskwait #pragma omp task // task C do_other_important_stuff(dst); 1. Task A detaches 2. taskwait does not continue 3. When memory transfer completes, callback is invoked to signal the event for task completion 4. taskwait continues, task C executes 14

Detachable Tasks and Asynchronous APIs void CUDART_CB callback(cudastream_t stream, cudaerror_t status, void *cb_dat) omp_fulfill_event((omp_event_t *) cb_data); void cuda_example() omp_event_t *cuda_event; #pragma omp task depend(out:dst) detach(cuda_event) // task A do_something(); cudamemcpyasync(dst, src, nbytes, cudamemcpydevicetohost, stream); cudastreamaddcallback(stream, callback, cuda_event, 0); #pragma omp task do_something_else(); // task B #pragma omp task depend(in:dst) // task C do_other_important_stuff(dst); 1. Task A detaches and task C will not execute because of its unfulfilled dependency on A 2. The Memory transfer completes and the callback is invoked to signal the event for task completion 3. Task A completes and C s dependency is fulfilled 15

Memory Allocators New clause on all constructs with data sharing clauses: allocate( [allocator:] list ) Allocation: omp_alloc(size_t size, omp_allocator_t *allocator) Deallocation: omp_free(void *ptr, const omp_allocator_t *allocator) allocator argument is optional allocate directive Standalone directive for allocation, or declaration of allocation stmt. 16

Example: Using Memory Allocators void allocator_example(omp_allocator_t *my_allocator) int a[m], b[n], c; #pragma omp allocate(a) allocator(omp_high_bw_mem_alloc) #pragma omp allocate(b) // controlled by OMP_ALLOCATOR and/or omp_set_default_allocator double *p = (double *) omp_alloc(n*m*sizeof(*p), malloc(n*m*sizeof(*p)); my_allocator); #pragma omp parallel private(a) some_parallel_code(); allocate(my_allocator:a) #pragma omp target firstprivate(c) allocate(omp_const_mem_alloc:c) // on target; must be compile-time expr #pragma omp parallel private(a) allocate(omp_high_bw_mem_alloc:a) some_other_parallel_code(); omp_free(p); 17

Partitioning Memory void allocator_example() double *array; omp_allocator_t *allocator; omp_alloctrait_t traits[] = OMP_ATK_PARTITION, OMP_ATV_BLOCKED Core ; int ntraits = sizeof(traits) / sizeof(*traits); on-chip allocator = omp_init_allocator(omp_default_mem_space, ntraits, traits); cache Core on-chip cache Core on-chip cache Core on-chip cache array = omp_alloc(sizeof(*array) * N, allocator); #pragma omp parallel for proc_bind(spread) for (int i = 0; i < N; ++i) important_computation(&array[i]); interconnect omp_free(array); memory memory A[0] A[N/2-1] A[N/2] A[N] 18

loop Construct Existing loop constructs are tightly bound to execution model: #pragma omp parallel for for (i=0; i<n;++i) #pragma omp simd for (i=0; i<n;++i) #pragma omp taskloop for (i=0; i<n;++i) fork generate tasks distribute work barrier join taskwait The loop construct is meant to let the OpenMP implementation pick the right parallelization scheme for a parallel loop. 19

OpenMP 4.5 Multi-level Device Parallelism int main(int argc, const char* argv[]) float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Define scalars n, a, b & initialize x, y #pragma omp target map(to:x[0:n]) map(tofrom:y) #pragma omp teams distribute parallel for \ num_teams(num_blocks) num_threads(bsize) for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; 20

Simplifying Multi-level Device Parallelism int main(int argc, const char* argv[]) float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Define scalars n, a, b & initialize x, y #pragma omp target map(to:x[0:n]) map(tofrom:y) #pragma omp teams loop distribute parallel for \ for num_teams(num_blocks) (int i = 0; i < n; ++i) num_threads(bsize) for y[i] (int= i a*x[i] = 0; i + < y[i]; n; ++i) y[i] = a*x[i] + y[i]; 21

Device Profiles Unified Memory Architecture Heterogeneous programming requires map clauses to transfer (ownership of) data to target devices Not needed for unified memory devices (unless for optimization) typedef struct mypoints struct myvec * x; struct myvec scratch; double useless_data[500000]; mypoints_t; #pragma omp requires unified_shared_memory mypoints_t p = new_mypoints_t(); #pragma omp target // no map clauses needed do_something_with_p(&p); 22

The metadirective Directive Construct OpenMP directives for different OpenMP contexts Limited form of meta-programming for OpenMP directives and clauses #pragma omp target map(to:v1,v2) map(from:v3) #pragma omp metadirective \ when( device=arch(nvptx): teams loop ) \ default( parallel loop ) for (i = lb; i < ub; i++) v3[i] = v1[i] * v2[i];!$omp begin metadirective & when( implementation=unified_shared_memory: target ) & default( target map(mapper(vec_map),tofrom: vec) )!$omp teams distribute simd do i=1, vec%size() call vec(i)%work() end do!$omp end teams distribute simd!$omp end metadirective 23

Declare Variant OpenMP 4.5 can create multiple versions of functions #pragma omp declare target #pragma omp declare simd simdlen(4) int important_stuff(int x) // code of the function important_stuff declare variant injects user-defined versions of functions #pragma omp declare variant( int important_stuff(int x) ) \ match( context=target,simd device=arch(nvptx) ) int important_stuff_nvidia(int x) /* Specialized code for NVIDIA target */ #pragma omp declare variant(int foo(int x)) \ match( context=target, simd(simdlen(4)), device=isa(avx2) ) m256i _mm256_epi32_important_stuff( m256i x); /* Specialized code for simd loop called on an AVX2 processor */ 24

Declare Variant Execution Context Context describes lexical scope of an OpenMP construct and it s lexical nesting in other OpenMP constructs: // context = #pragma omp target teams // context = target, teams #pragma omp parallel // context = target, teams, parallel #pragma omp simd aligned(a:64) for (...) // context = target, teams, parallel, simd(aligned(a:64), simdlen(8), notinbranch) foo(a); 25

OpenMP Tools Interfaces Developer view Implementation view 26

OpenMP Tools Interface (OMPT) Provide interfaces for first-party tools to attach to the OpenMP implementation for performance analysis and correctness checking Register callbacks 2 OpenMP Program 1 st Party Tool (same process) 4 Querying further runtime information Examples for OMPT callbacks for events ompt_callback_parallel_begin ompt_callback_parallel_end ompt_start_tool ompt_initialize 1 OpenMP 3 Runtime Library (RTL) Callbacks on OpenMP events ompt_callback_implicit_task ompt_callback_task_create Application Address Space and Threads ompt_callback_dependences... 27

OpenMP Debugging Interface (OMPD) Provide interfaces for third-party tools to inspect the current state of OpenMP execution. OpenMP Program Debugger OpenMP Runtime Library Application Threads 1 Request OpenMP state Result Attach Callback 3 OMPD DLL Debugger OMPD DLL loaded into debugger and callbacks registered Request symbol addresses and read memory 2 28

OpenMP Debugging Interface (OMPD) Segmentation fault occurs in thread 100; is this an OpenMP thread? Debugger gets: Address space Load OMPD DLL Initialize OMPD Initialize process Is thread in a device? Yes Request thread handle for thread identifier No Initialize device Returned OK? Yes Thread 100 is an OpenMP thread No Debugger gets: Thread handle Debugger gets: Device Address space 29

The Last Slide OpenMP 5.0 will be a major leap forward To be released at SC18 Well-defined interfaces for tools New ways to express parallelism OpenMP is a modern directive-based programming model Multi-level parallelism supported (coprocessors, threads, SIMD) Task-based programming model is the modern approach to parallelism Powerful language features for complex algorithms High-level access to parallelism; path forward to highly efficient programming 30

Visit www.openmp.org for more information