OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Similar documents
OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Shared Memory Programming with OpenMP (3)

Introduction to OpenMP.

Introduction to OpenMP. Lecture 4: Work sharing directives

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Multithreading in C with OpenMP

Parallel Programming

Lecture 4: OpenMP Open Multi-Processing

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

Introduction to OpenMP

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

OpenMP loops. Paolo Burgio.

Shared Memory Programming with OpenMP

CS691/SC791: Parallel & Distributed Computing

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

EE/CSCI 451: Parallel and Distributed Computation

Introduction to OpenMP

Computational Mathematics

Shared Memory Programming Paradigm!

Distributed Systems + Middleware Concurrent Programming with OpenMP

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

Alfio Lazzaro: Introduction to OpenMP

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

Multi-core Architecture and Programming

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Programming with OpenMP. CS240A, T. Yang

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

OpenMP 4.5: Threading, vectorization & offloading

CSL 860: Modern Parallel

OpenMP dynamic loops. Paolo Burgio.

Synchronization. Event Synchronization

High Performance Computing: Tools and Applications

Parallel programming using OpenMP

Scientific Computing

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Introduction to OpenMP

Introduction to OpenMP

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

An Introduction to OpenMP

Data Environment: Default storage attributes

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

Allows program to be incrementally parallelized

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

Introduction to Standard OpenMP 3.1

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

ECE 574 Cluster Computing Lecture 10

OPENMP OPEN MULTI-PROCESSING

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Parallel Programming using OpenMP

Parallel Programming using OpenMP

Parallel Computing Parallel Programming Languages Hwansoo Han

Shared Memory Programming : OpenMP

Session 4: Parallel Programming with OpenMP

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

CS 470 Spring Mike Lam, Professor. OpenMP

Introduction to. Slides prepared by : Farzana Rahman 1

Introduction to OpenMP. Motivation

Lecture 2 A Hand-on Introduction to OpenMP

EPL372 Lab Exercise 5: Introduction to OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming in C with MPI and OpenMP

Concurrent Programming with OpenMP

Open Multi-Processing: Basic Course

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Shared Memory Parallelism - OpenMP

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

CS691/SC791: Parallel & Distributed Computing

Introduc4on to OpenMP and Threaded Libraries Ivan Giro*o

COMP Parallel Computing. SMM (2) OpenMP Programming Model

Parallel Programming in C with MPI and OpenMP

Module 11: The lastprivate Clause Lecture 21: Clause and Routines. The Lecture Contains: The lastprivate Clause. Data Scope Attribute Clauses

Programming with OpenMP*

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Parallel Programming in C with MPI and OpenMP

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Massimo Bernaschi Istituto Applicazioni del Calcolo Consiglio Nazionale delle Ricerche.

Advanced OpenMP. OpenMP Basics

Parallel Numerical Algorithms

!OMP #pragma opm _OPENMP

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

15-418, Spring 2008 OpenMP: A Short Introduction

UPDATES. 1. Threads.v. hyperthreading

OpenMP Language Features

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Cluster Computing 2008

Transcription:

OpenMP - III Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16

OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press, 2007. B. Chapman, G. Jost, R. van der Pas. Diego Fabregat OpenMP - III 2 / 1

Worksharing constructs Distribute the execution of code among the team members Loop construct Sections construct Single construct Diego Fabregat OpenMP - III 3 / 1

Loop construct Syntax: Example: #pragma omp for [clause [, clause]...] for-loop #pragma omp parallel { #pragma omp for for (i = 0; i < n; i++) [...] } Canonical form: for (init-expr ; var relop b ; incr-expr) Diego Fabregat OpenMP - III 4 / 1

Loop construct Sequential: int i; for (i = 0; i < N; i++) z[i] = alpha * x[i] + y[i]; Parallel region: #pragma omp parallel { int i, id, nths, istart, iend; id = omp_get_thread_num(); nths = omp_get_num_threads(); istart = id * N / nths; iend = (id+1) * N / nths; if (id == nths-1) iend = N; for (i = istart; i < iend; i++) z[i] = alpha * x[i] + y[i]; } Diego Fabregat OpenMP - III 5 / 1

Loop construct Sequential: int i; for (i = 0; i < N; i++) z[i] = alpha * x[i] + y[i]; Parallel region: #pragma omp parallel { int i, id, nths, istart, iend; id = omp_get_thread_num(); nths = omp_get_num_threads(); istart = id * N / nths; iend = (id+1) * N / nths; if (id == nths-1) iend = N; for (i = istart; i < iend; i++) z[i] = alpha * x[i] + y[i]; } Loop construct: #pragma omp parallel { int i; #pragma omp for for (i = 0; i < N; i++) z[i] = alpha * x[i] + y[i]; } Diego Fabregat OpenMP - III 5 / 1

Loop construct Sequential: int i; for (i = 0; i < N; i++) z[i] = alpha * x[i] + y[i]; Parallel region: #pragma omp parallel { int i, id, nths, istart, iend; id = omp_get_thread_num(); nths = omp_get_num_threads(); istart = id * N / nths; iend = (id+1) * N / nths; if (id == nths-1) iend = N; for (i = istart; i < iend; i++) z[i] = alpha * x[i] + y[i]; } Loop construct: #pragma omp parallel { int i; #pragma omp for for (i = 0; i < N; i++) z[i] = alpha * x[i] + y[i]; } Shortcut: int i; #pragma omp parallel for for (i = 0; i < N; i++) z[i] = alpha * x[i] + y[i]; Diego Fabregat OpenMP - III 5 / 1

Loop construct Guidelines: Identify the compute intensive loops Can the iterations be run independently? If needed/possible remove dependencies Add the for directive Consider alternative schedulings if bad load balancing Diego Fabregat OpenMP - III 6 / 1

Loop construct - Dependencies So far, trivially parallelizable loops: for (i = 0; i < n; i++) { z[i] = alpha * x[i] + y[i]; w[i] = z[i] * z[i] } What if we run the following code in parallel? fib[0] = fib[1] = 1; for (i = 2; i < n; i++) fib[i] = fib[i-1] + fib[i-2] When run in my system with n = 10 and 2 threads: Sometimes: [1, 1, 2, 3, 5, 8, 13, 21, 34, 55] Others: [1, 1, 2, 3, 5, 0, 0, 0, 0, 0] Diego Fabregat OpenMP - III 7 / 1

Loop construct - Dependencies Loop carried dependencies A memory location is written in one iteration, and it is also read or written in another iteration for (i = 1; i < n; i++) a[i] = a[i] + a[i-1] Race condition: results depends on the order in which operations are performed Classification of dependencies: Flow/True dependencies Anti dependencies Output dependencies Diego Fabregat OpenMP - III 8 / 1

Loop construct - Dependencies Flow dependence: Iteration i writes to a location Iteration i + 1 reads from the location We can remove some loop-carried dependencies. For instance: Reductions: x = x + a[i]; Induction variables: j = 5; for (i = 1; i < n; i++) { j += 2; a[i] = f(j); } Rewrite j += 2; as j = 5 + 2*i; Diego Fabregat OpenMP - III 9 / 1

Loop construct - Dependencies Anti dependence: Iteration i reads to a location Iteration i + 1 writes from the location Example for (i = 0; i < n-1; i++) a[i] = a[i+1] + f(i); Split the loop, and make a copy of array a #pragma omp parallel for for (i = 0; i < n-1; i++) a2[i] = a[i+1]; #pragma omp parallel for for (i = 0; i < n-1; i++) a[i] = a2[i] + f(i); Diego Fabregat OpenMP - III 10 / 1

Loop construct - Dependencies Output dependence: Iteration i writes to a location Iteration i + 1 writes from the location Example for (i = 0; i < n; i++) { tmp = A[i]; A[i] = B[i]; B[i] = tmp; } Scalar expansion of tmp to array unnecessarily expensive. Make tmp private. #pragma omp parallel for private(tmp) for (i = 0; i < n; i++) { tmp = A[i]; A[i] = B[i]; B[i] = tmp; } Diego Fabregat OpenMP - III 11 / 1

Loop construct - Dependencies Other situations Even/odd parallelization for (i = 2; i < n; i++) a[i] = a[i-2] + x; can be rewritten as for (i = 2; i < n; i+=2) a[i] = a[i-2] + x; for (i = 3; i < n; i+=2) a[i] = a[i-2] + x; and execute each loop in parallel Not a dependence: for (i = 0; i < n; i++) a[i] = a[i+n] + x; Diego Fabregat OpenMP - III 12 / 1

Exercise - Loop-carried dependencies Try to parallelize the different pieces of code below. for ( i = 0; i < n-1; i++ ) a[i] = a[i+1] + b[i] * c[i]; for ( i = 1; i < n; i++ ) a[i] = a[i-1] + b[i] * c[i]; t = 1; for ( i = 0; i < n-1; i++ ) { a[i] = a[i+1] + b[i] * c[i]; t = t * a[i]; } Diego Fabregat OpenMP - III 13 / 1

Scheduling Diego Fabregat OpenMP - III 14 / 1

Loop construct - Scheduling Scheduling: how loop iterations are distributed among threads. To achieve best performance, threads must be busy most of the time, minimizing the time they remain idle, wasting resources. Keys: Good load balance (work evenly distributed) Minimum scheduling overhead Minimum synchronization Diego Fabregat OpenMP - III 15 / 1

Loop construct - Scheduling OpenMP allows us to choose among several scheduling schemes via the schedule clause schedule(kind[, chunk_size]) where kind is one of: static dynamic guided auto runtime and the iteration space is split in chunks of chunk_size consecutive iterations. Diego Fabregat OpenMP - III 16 / 1

Loop construct schedule: static Divide the iterations in NUM_THREADS (roughly) equal chunks and give one to each of them (in order) If chunk_size is specified, divide in chunks of size chunk_size, and distribute them cyclically in a round robin fashion Example: #pragma omp for schedule(static, 4) for (i = 0; i < 20; i++) [...] Assuming execution with 2 threads: TH0: [0-3] TH1: [4-7] TH0: [8-11] TH1: [12-15] TH0: [16-19] Diego Fabregat OpenMP - III 17 / 1

Loop construct - Clauses schedule: dynamic Conceptually, this scheme creates a queue of chunks, from which the threads keep grabing chunks to execute, until no more chunks are left By default, chunk_size equals 1 Example: #pragma omp for schedule(dynamic, 2) for (i = 0; i < 10; i++) [...] Possible run assuming execution with 2 threads: TH0: [0-1] TH1: [2-3] TH1: [4-5] TH0: [6-7] TH1: [8-9] Diego Fabregat OpenMP - III 18 / 1

Loop construct schedule: guided Similar to dynamic Difference: start with large chunk sizes, which exponentially decrease in size Chunks consist of at least chunk_size iterations (except maybe the last one) chunk_size defaults to 1 Example: #pragma omp for schedule(guided, 2) for (i = 0; i < 40; i++) [...] Possible run assuming execution with 3 threads: TH2: [0-13] TH0: [14-22] T1: [23-28] [29-32]... Diego Fabregat OpenMP - III 19 / 1

Loop construct schedule: auto and runtime auto Decision on scheduling delegated to the compiler/runtime The programmer gives freedom to the implementation May simply resort to static or dynamic runtime Decision on scheduling is deferred until run time Schedule and chunk size are taken from internal variables May be specified via a routine call or via environment variables: omp_set_schedule( kind, chunk_size ) export OMP_SCHEDULE="kind,chunk_size" Mainly used for testing (so that we do not need to edit and recompile every time) Diego Fabregat OpenMP - III 20 / 1

Loop construct - Scheduling Most used: static and dynamic static: Workload is predetermined and predictable by the programmer Cyclic distribution allows good load balancing Scheduling is done at compile time, adding little parallel overhead dynamic: Unpredictable/variable work per iteration Decisions on scheduling made at runtime, adding overhead See: 05.loop-schedule.c Diego Fabregat OpenMP - III 21 / 1

A few more clauses Diego Fabregat OpenMP - III 22 / 1

Loop construct - Clauses collapse Syntax: collapse(n) Indicates how many loops are associated with the loop construct n must be a constant positive integer expression The iterations of the n loops are collapsed into one larger iteration space The order of the collapsed space is that of the equivalent sequential execution of the iterations Diego Fabregat OpenMP - III 23 / 1

Loop construct - Clauses collapse Example: #pragma omp for collapse(3) for (i = 0; i < m; i++) for (j = 0; j < n; j++) for (k = 0; k < l; k++) [...] The resulting iteration space is of size m n l Diego Fabregat OpenMP - III 24 / 1

Loop construct - Clauses ordered Used in combination with the ordered construct Specifies a structured block (portion of code) in a loop region to be executed in sequential order. Sequentializes and orders the execution within an ordered region while allowing code outside to be run in parallel Diego Fabregat OpenMP - III 25 / 1

Loop construct - Clauses ordered Example #pragma omp parallel { #pragma omp for ordered for (i = 0; i < n; i++) { int id = omp_get_thread_num(); printf("thread %d updates a[%d]\n", id, i); a[i] += i; } } #pragma omp ordered printf("thread %d prints value of a[%d] = %d\n", id, i, a[i]); Diego Fabregat OpenMP - III 26 / 1

Synchronization Diego Fabregat OpenMP - III 27 / 1

Synchronization OpenMP: Shared-memory programming model Unintended sharing of data causes race conditions Protect data conflicts with synchronization Main constructs/tools: Barriers: barrier Critical sections: critical atomic Locks (low level) Diego Fabregat OpenMP - III 28 / 1

Synchronization - Barriers Syntax: #pragma omp barrier Synchronizes all threads in the enclosing parallel region Ensures that all the code before the barrier has been executed by all threads before proceeding with the code beyond the barrier Diego Fabregat OpenMP - III 29 / 1

Synchronization - Barriers #pragma omp parallel { int id = omp_get_thread_num(); A[id] = lot_of_work_inside( id ); #pragma omp barrier B[id] = more_work( A, id ); } Diego Fabregat OpenMP - III 30 / 1

Synchronization - Critical sections Syntax: #pragma omp critical When? Every thread must execute a section of the code They can execute it in any order Mutual exclusion is required Diego Fabregat OpenMP - III 31 / 1

Synchronization - Critical sections Don t do this at home! double sum = 0.0, pi; double step = 1.0/NUM_STEPS; double x_i; int i; #pragma omp parallel for private(x_i) for ( i = 0; i < NUM_STEPS; i++ ) { x_i = (i + 0.5) * step; #pragma omp critical sum = sum + 4.0 / (1.0 + x_i * x_i); } pi = sum * step; Diego Fabregat OpenMP - III 32 / 1

Synchronization - Critical sections To keep in mind... Minimize the size of critical sections Refactor and pull heavy work outside the region Have local copies, then reduce in a critical region Make sure you don t hit the critical section too often (like every ms) The overhead will kill your performance Diego Fabregat OpenMP - III 33 / 1

Synchronization - Atomic Syntax: #pragma omp atomic The atomic construct is a very restricted form or critical Often, hardware provides support to perform quick updates of memory locations If those hardware instructions are available, atomic tells the compiler to use them Otherwise, acts as a critical region Diego Fabregat OpenMP - III 34 / 1

Synchronization - Atomic Restricted critical section, applied to a single statement. Valid statements: x++, x, ++x, x x binop= expr x = x binop expr x = expr binop x where x is of scalar type expr is an expression of scalar type (which does not include x binop is one of +, -, *, /, &, ˆ,, «,» Diego Fabregat OpenMP - III 35 / 1

Synchronization # threads Sequential critical atomic reduction 1 0.027 secs 0.031 0.026 0.030 2 0.027 secs 0.295 0.108 0.015 4 0.027 secs 0.572 0.117 0.008 Diego Fabregat OpenMP - III 36 / 1