Advanced OpenMP Features

Similar documents
Introduction to OpenMP

OpenMP 4.5: Threading, vectorization & offloading

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

Introduction to OpenMP

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP 4.0 (and now 5.0)

OpenMP 4.0/4.5. Mark Bull, EPCC

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Parallel Accelerators

ECE 574 Cluster Computing Lecture 10

Lecture 4: OpenMP Open Multi-Processing

Tasking and OpenMP Success Stories

Parallel Accelerators

Advanced OpenMP. Lecture 11: OpenMP 4.0

OpenMP 4.0. Mark Bull, EPCC

OpenMP Technical Report 1 on Directives for Attached Accelerators

Advanced OpenMP Vectoriza?on

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

New Features after OpenMP 2.5

OpenMP Tutorial. Dirk Schmidl. IT Center, RWTH Aachen University. Member of the HPC Group Christian Terboven

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

OpenMP 4.5 target. Presenters: Tom Scogland Oscar Hernandez. Wednesday, June 28 th, Credits for some of the material

Parallel Programming Overview

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach

Progress on OpenMP Specifications

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

NUMA-aware OpenMP Programming

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

OpenMP API Version 5.0

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Parallel Computing. November 20, W.Homberg

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

OpenMP for Accelerators

Vc: Portable and Easy SIMD Programming with C++

[Potentially] Your first parallel application

What will be new in OpenMP 4.0?

Distributed Systems + Middleware Concurrent Programming with OpenMP

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

S Comparing OpenACC 2.5 and OpenMP 4.5

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

Introduction to Standard OpenMP 3.1

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Accelerator Programming Lecture 1

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Computational Mathematics

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Bring your application to a new era:

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

Intel Xeon Phi programming. September 22nd-23rd 2015 University of Copenhagen, Denmark

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

Heterogeneous Computing and OpenCL

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Overview: The OpenMP Programming Model

15-418, Spring 2008 OpenMP: A Short Introduction

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

OpenMP 4 (and now 5.0)

Shared memory parallel computing

Alfio Lazzaro: Introduction to OpenMP

Parallel Programming

A short stroll along the road of OpenMP* 4.0

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Advanced OpenMP: Tools

Introduction to OpenMP

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

OpenMP on Ranger and Stampede (with Labs)

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

HPCSE - II. «OpenMP Programming Model - Tasks» Panos Hadjidoukas

Programming Shared Memory Systems with OpenMP Part I. Book

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

A brief introduction to OpenMP

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

OpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat

OpenCL Vectorising Features. Andreas Beckmann

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

EPL372 Lab Exercise 5: Introduction to OpenMP

OpenMP programming Part II. Shaohao Chen High performance Louisiana State University

What s New August 2015

Review. Lecture 12 5/22/2012. Compiler Directives. Library Functions Environment Variables. Compiler directives for construct, collapse clause

Dynamic SIMD Scheduling

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Mathematical computations with GPUs

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Past, Present, and Future of OpenMP* (An OpenMP Carol)

POSIX Threads and OpenMP tasks

Parallel Computing. Prof. Marco Bertini

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Transcription:

Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University

Vectorization 2

Vectorization SIMD = Single Instruction Multiple Data Special hardware instructions to operate on multiple data points at once Instructions work on vector registers Vector length is hardware dependent Sequential Vectorized double a[4],b[4],c[4]; for(i=0 ; i < 4 ; i++) { a[i]=b[i]+c[i]; Step 1 b[0] + c[0] = Step 2 b[1] + c[1] = Step 3 b[2] + c[2] = Step 4 b[3] + c[3] = Step 1 b[0],b[1] + c[0],c[1] = Step 2 b[2],b[3] + c[2],c[3] = a[0] a[1] a[2] a[3] a[0],a[1] a[2],a[3] 3

Vectorization Vector lengths on Intel architectures 128 bit: SSE = Streaming SIMD Extensions 2 x double 4 x float 256 bit: AVX = Advanced Vector Extensions 4 x double 8 x float 512 bit: AVX-512 8 x double 16 x float 4

Data Alignment Vectorization works best on aligned data structures. Good alignment Address: 0 8 16 24 32 40 48 56 Data: Vectors: a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] Bad alignment Address: 8 16 24 32 40 48 56 64 Data: a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] Vectors: Very bad alignment Address: 4 12 20 28 36 44 52 60 Data: Vectors: a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] 5

Ways to Vectorize Compiler auto-vectorization easy Explicit Vector Programming (e.g. with OpenMP) Inline Assembly (e.g. ) Assembler Code (e.g. addps, mulpd, ) explicit 6

The OpenMP SIMD constructs 7

The SIMD construct The SIMD construct enables the execution of multiple iterations of the associated loops concurrently by means of SIMD instructions. C/C++: #pragma omp simd [clause(s)] for-loops where clauses are: linear(list[:linear-step]), a variable increases linearly in every loop iteration aligned(list[:alignment]), specifies that data is aligned private(list), as usual lastprivate(list), as usual reduction(reduction-identifier:list), as usual Fortran:!$omp simd [clause(s)] do-loops [!$omp end simd] 8 collapse(n), collapse loops first, and than apply SIMD instructions

The SIMD construct The safelen clause allows to specify a distance of loop iterations where no dependencies occur. Sequential Vector length 128-bit double a[6],b[6]; Step 1 a[0] Step 2 a[1] Step 3 a[2] Step 4 a[3] Step 1 a[0],a[1] Step 2 a[2],a[3] for(i=2 ; i < 6 ; i++) { a[i]=a[i-2]*b[i]; * b[2] = * b[3] = * b[4] = * b[5] = * b[2],b[3] = * b[4],b[5] = a[2] a[3] a[4] a[5] a[2],a[3] a[4],a[5] 9

The SIMD construct The safelen clause allows to specify a distance of loop iterations where no dependencies occur. double a[6],b[6]; Vector length 128-bit Step 1 a[0],a[1] Step 2 a[2],a[3] Vector length 256-bit Step 1 a[0],a[1],a[2],a[3] for(i=2 ; i < 6 ; i++) { a[i]=a[i-2]*b[i]; * b[2],b[3] = * b[4],b[5] = * b[2],b[3],b[4],b[5] = a[2],a[3] a[4],a[5] a[2],a[3],a[4],a[5] Any vector length smaller than or equal to the length specified by safelen can be chosen for vectorizaion. In contrast to parallel for/do loops the iterations are executed in a specified order. 10

The loop SIMD construct The loop SIMD construct specifies a loop that can be executed in parallel by all threads and in SIMD fashion on each thread. C/C++: #pragma omp for simd [clause(s)] for-loops Fortran:!$omp do simd [clause(s)] do-loops [!$omp end do simd [nowait]] Loop iterations are first distributed across threads, then each chunk is handled as a SIMD loop. Clauses: All clauses from the loop- or SIMD-construct are allowed Clauses which are allowed for both constructs are applied twice, once for the threads and once for the SIMDization. 11

The declare SIMD construct Function calls in SIMD-loops can lead to bottlenecks, because functions need to be executed serially. for(i=0 ; i < N ; i++) { a[i]=b[i]+c[i]; d[i]=sin(a[i]); e[i]=5*d[i]; SIMD lanes Solutions: avoid or inline functions create functions which work on vectors instead of scalars 12

The declare SIMD construct Enables the creation of multiple versions of a function or subroutine where one or more versions can process multiple arguments using SIMD instructions. C/C++: #pragma omp declare simd [clause(s)] [#pragma omp declare simd [clause(s)]] function definition / declaration where clauses are: simdlen(length), the number of arguments to process simultanously linear(list[:linear-step]), a variable increases linearly in every loop iteration aligned(argument-list[:alignment]), specifies that data is aligned uniform(argument-list), arguments have an invariant value inbranch / notinbranch, function is always/never called from within a conditional statement Fortran:!$omp declare simd (proc_name)[clause(s)] 13

File: f.c #pragma PI Example omp declare Code simd double f(double x) { return (4.0 / (1.0 + x*x)); File: pi.c #pragma omp declare simd double f(double x); #pragma omp simd linear(i) private(fx) reduction(+:fsum) for (i = 0; i < n; i++) { fx = fh * ((double)i + 0.5); fsum += f(fx); return fh * fsum; Calculating Pi with numerical integration of: π = 0 1 4 1 + x 2 14

Example 1: Pi Runtime of the benchmark on: Westmere CPU with SSE (128-bit vectors) Intel Xeon Phi with AVX-512 (512-bit vectors) Runtime Westmere Speedup Westmere Runtime Xeon Phi Speedup Xeon Phi non vectorized 1.44 sec 1 16.25 sec 1 vectorized 0.72 sec 2 1.82 sec 8.9 Note: Speedup for memory bound applications might be lower on both systems. 15

OpenMP for Accelerators 16

Intel Xeon Phi Source: Intel Intel Xeon Phi Coprocessor 1 x Intel Xeon Phi @ 1090 MHz 60 Cores (in-order) ~ 1 TFLOPS DP Peak 4 hardware threads per core (SMT) 8 GB GDDR5 memory 512-bit SIMD vectors (32 registers) Fully-coherent L1 and L2 caches Plugged into PCI Express bus 17

NVIDIA Corporation 2010 GPU architecture: Fermi 3 billion transistors 14-16 streaming multiprocessors (SM) Each comprises 32 cores SM 448-512 cores/ streaming processors (SP) i.a. Floating point & integer unit Memory hierarchy Peak performance SP: 1.03 TFlops DP: 515 GFlops ECC support Compute capability: 2.0 GPU Defines features, e.g. double precision capability, memory access pattern 18

Execution + Data Model Data environment is lexically scoped Data environment is destroyed at closing curly brace Allocated buffers/data are automatically released Use target construct to Transfer control from the host to the device Establish a data environment (if not yet done) 19 Host thread waits until offloaded region completed pa Host 2 to( ) 4 from( ) Device 1 alloc( ) #pragma omp target \ map(alloc:...) \ map(to:...) \ map(from:...) {... 3

Example: SAXPY 20

SAXPY: Serial (Host) int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // y is needed and modified on the host here for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; 21

SAXPY: OpenMP 4.0 (Intel MIC) int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma omp target map(tofrom:y[0:n]) map(to:x[0:n]) #pragma omp parallel for for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // y is needed and modified on the host here #pragma omp target map(tofrom:y[0:n]) map(to:x[0:n]) #pragma omp parallel for for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; 22

SAXPY: OpenMP 4.0 (Intel MIC) int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma omp target data map(to:x[0:n]) { #pragma omp target map(tofrom:y[0:n]) #pragma omp parallel for for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; 23 // y is needed and modified on the host here #pragma omp target map(tofrom:y[0:n]) #pragma omp parallel for for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0;

SAXPY: OpenMP 4.0 (NVIDIA GPGPU) int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y 24 // Run SAXPY TWICE #pragma omp target data map(to:x[0:n]) { #pragma omp target map(tofrom:y[0:n]) #pragma omp teams #pragma omp distribute #pragma omp parallel for for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // y is needed and modified on the host here #pragma omp target map(tofrom:y[0:n]) #pragma omp teams #pragma omp distribute #pragma omp parallel for for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0;

Target Construct 25

Target Data Construct Creates a device data environment for the extent of the region when a target data construct is encountered, a new device data environment is created, and the encountering task executes the target data region when an if clause is present and the if-expression evaluates to false, the device is the host C/C++: 26

Map Clause Map a variable from the current task's data environment to the device data environment associated with the construct the list items that appear in a map clause may include array sections alloc-type: each new corresponding list item has an undefined initial value to-type: each new corresponding list item is initialized with the original lit item's value from-type: declares that on exit from the region the corresponding list item's value is assigned to the original list item tofrom-type: the default, combination of to and from C/C++: 27

Target Construct Creates a device data environment and execute the construct on the same device superset of the target data constructs - in addition, the target construct specifies that the region is executed by a device and the encountering task waits for the device to complete the target region C/C++: 28

Example: Target Construct #pragma omp target device(0) #pragma omp parallel for for (i=0; i<n; i++)... #pragma omp target #pragma omp teams num_teams(8) num_threads(4) #pragma omp distribute for ( k = 0; k < NUM_K; k++ ) { #pragma omp parallel for for ( j = 0; j < NUM_J; j++ ) {... 29

Target Update Construct Makes the corresponding list items in the device data environment consistent with their original list items, according to the specified motion clauses C/C++: 30

Declare Target Directive Specifies that [static] variables, functions (C, C++ and Fortran) and subroutines (Fortran) are mapped to a device if a list item is a function or subroutine then a device-specific version of the routines is created that can be called from a target region if a list item is a variable then the original variable is mapped to a corresponding variable in the initial device data environment for all devices (if the variable is initialized it is mapped with the same value) all declarations and definitions for a function must have a declare target directive C/C++: 31

Teams Construct (1/2) Creates a league of thread teams where the master thread of each team executes the region the number of teams is determined by the num_teams clause, the number of threads in each team is determined by the num_threads clause, within a team region team numbers uniquely identify each team (omp_get_team_num()) once created, the number of teams remeinas constant for the duration of the teams region The teams region is executed by the master thread of each team The threads other than the master thread to not begin execution until the master thread encounteres a parallel region Only the following constructs can be closely nested in the team region: distribute, parallel, parallel loop/for, parallel sections and parallel workshare 32

Teams Construct (2/2) A teams construct must be contained within a target construct, which must not contain any statements or directives outside of the teams construct After the teams have completed execution of the teams region, the encountering thread resumes execution of the enclosing target region C/C++: 33

Distribute Construct Specifies that the iteration of one or more loops will be executed by the thread teams, the iterations are distributed across the master threads of all teams there is no implicit barrier at the end of a distribute construct a distribute construct must be closely nested in a teams region C/C++: 34

Questions? 35