Session 4: Parallel Programming with OpenMP

Similar documents
A brief introduction to OpenMP

EPL372 Lab Exercise 5: Introduction to OpenMP

OpenMP Fundamentals Fork-join model and data environment

Overview: The OpenMP Programming Model

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Shared Memory Parallelism - OpenMP

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Lecture 4: OpenMP Open Multi-Processing

Introduction to OpenMP

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to OpenMP

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

ECE 574 Cluster Computing Lecture 10

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

Introduction to Standard OpenMP 3.1

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Introduction to OpenMP.

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Data Environment: Default storage attributes

Allows program to be incrementally parallelized

OpenMP - Introduction

Shared Memory Parallelism using OpenMP

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Shared Memory Programming with OpenMP

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

CSL 860: Modern Parallel

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

Parallel programming using OpenMP

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to OpenMP

Multithreading in C with OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

Introduction to. Slides prepared by : Farzana Rahman 1

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

CS 470 Spring Mike Lam, Professor. OpenMP

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Parallel Computing Parallel Programming Languages Hwansoo Han

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Introduction to OpenMP

Multi-core Architecture and Programming

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

Parallel and Distributed Programming. OpenMP

Programming Shared Memory Systems with OpenMP Part I. Book

Distributed Systems + Middleware Concurrent Programming with OpenMP

Shared Memory programming paradigm: openmp

OpenMP: Open Multiprocessing

[Potentially] Your first parallel application

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

15-418, Spring 2008 OpenMP: A Short Introduction

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

OpenMP Application Program Interface

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

Parallel Programming using OpenMP

Parallel Programming using OpenMP

Alfio Lazzaro: Introduction to OpenMP

Parallel Programming with OpenMP. CS240A, T. Yang

Amdahl s Law. AMath 483/583 Lecture 13 April 25, Amdahl s Law. Amdahl s Law. Today: Amdahl s law Speed up, strong and weak scaling OpenMP

CS 5220: Shared memory programming. David Bindel

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Shared memory parallel computing

Shared memory programming

Introduction [1] 1. Directives [2] 7

Parallel Programming in C with MPI and OpenMP

OPENMP OPEN MULTI-PROCESSING

CS691/SC791: Parallel & Distributed Computing

Shared Memory Programming with OpenMP (3)

OPENMP TIPS, TRICKS AND GOTCHAS

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

An Introduction to OpenMP

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

Parallel Programming

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

OpenMP: Open Multiprocessing

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

OPENMP TIPS, TRICKS AND GOTCHAS

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

High Performance Computing: Tools and Applications

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

Introduction to OpenMP

!OMP #pragma opm _OPENMP

Shared Memory Programming Paradigm!

An Introduction to OpenMP

Transcription:

Session 4: Parallel Programming with OpenMP Xavier Martorell Barcelona Supercomputing Center

Agenda Agenda 10:00-11:00 OpenMP fundamentals, parallel regions 11:00-11:30 Worksharing constructs 11:30-12:00 Break 12:00-12:15 Synchronization mechanisms in OpenMP 12:15-13:00 Practical: heat diffusion 13:00-14:00 Lunch 14:00-14:20 Programming using a hybrid MPI/OpenMP approach 14:20-17:00 Practical: heat diffusion Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 2 / 91

Part I OpenMP fundamentals, parallel regions Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 3 / 91

Outline OpenMP Overview The OpenMP model Writing OpenMP programs Creating Threads Data-sharing attributes Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 4 / 91

Outline OpenMP Overview OpenMP Overview The OpenMP model Writing OpenMP programs Creating Threads Data-sharing attributes Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 5 / 91

What is OpenMP? OpenMP Overview It s an API extension to the C, C++ and Fortran languages to write parallel programs for shared memory machines Current version is 3.1 (July 2011)... 4.0 is open for public comments Supported by most compiler vendors Intel, IBM, PGI, Sun, Cray, Fujitsu, HP, GCC... Maintained by the Architecture Review Board (ARB), a consortium of industry and academia http://www.openmp.org Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 6 / 91

A bit of history OpenMP Overview OpenMP Fortran 1.0 OpenMP C/C++ 1.0 OpenMP Fortran 1.1 OpenMP Fortran 2.0 OpenMP C/C++ 2.0 OpenMP 2.5 OpenMP 3.0 OpenMP 3.1 1997 98 99 2000 02 05 08 11 2012 Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 7 / 91

OpenMP Overview Advantages of OpenMP Mature standard and implementations Standardizes practice of the last 20 years Good performance and scalability Portable across architectures Incremental parallelization Maintains sequential version (mostly) High level language Some people may say a medium level language :-) Supports both task and data parallelism Communication is implicit Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 8 / 91

OpenMP Overview Disadvantages of OpenMP Communication is implicit Flat memory model Incremental parallelization creates false sense of glory/failure No support for accelerators (...yet, maybe in 4.0) No error recovery capabilities (...yet, 4.0) Difficult to compose Lacks high-level algorithms and structures Does not run on clusters Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 9 / 91

Outline The OpenMP model OpenMP Overview The OpenMP model Writing OpenMP programs Creating Threads Data-sharing attributes Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 10 / 91

OpenMP at a glance The OpenMP model OpenMP components Constructs Compiler OpenMP Executable OpenMP API Environment Variables OpenMP Runtime Library ICVs OS Threading Libraries CPU CPU CPU CPU CPU CPU SMP Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 11 / 91

The OpenMP model Execution model Fork-join model OpenMP uses a fork-join model The master thread spawns a team of threads that joins at the end of the parallel region Threads in the same team can collaborate to do work Master Thread Nested Parallel Region Parallel Region Parallel Region Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 12 / 91

The OpenMP model Memory model OpenMP defines a relaxed memory model Threads can see different values for the same variable Memory consistency is only guaranteed at specific points Luckily, the default points are usually enough Variables can be shared or private to each thread Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 13 / 91

Outline Writing OpenMP programs OpenMP Overview The OpenMP model Writing OpenMP programs Creating Threads Data-sharing attributes Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 14 / 91

Writing OpenMP programs OpenMP directives syntax In Fortran Through a specially formatted comment: s e n t i n e l c o n s t r u c t [ clauses ] where sentinel is one of:!$omp or C$OMP or *$OMP in fixed format!$omp in free format In C/C++ Through a compiler directive: #pragma omp c o n s t r u c t [ clauses ] OpenMP syntax is ignored if the compiler does not recognize OpenMP Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 15 / 91

Writing OpenMP programs OpenMP directives syntax In Fortran Through a specially formatted comment: s e n t i n e l c o n s t r u c t [ clauses ] where sentinel is one of:!$omp or C$OMP or *$OMP in fixed format!$omp in free format In C/C++ Through a compiler directive: #pragma omp c o n s t r u c t [ clauses ] OpenMP syntax is ignored if the compiler does not recognize We llopenmp be using C/C++ syntax through this tutorial Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 15 / 91

Writing OpenMP programs Headers/Macros C/C++ only omp.h contains the API prototypes and data types definitions The _OPENMP is defined by the OpenMP enabled compilers Allows conditional compilation of OpenMP Fortran only The omp_lib module contains the subroutine and function definitions Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 16 / 91

Writing OpenMP programs Structured Block Definition Most directives apply to a structured block: Block of one or more statements One entry point, one exit point No branching in or out allowed Terminating the program is allowed Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 17 / 91

Outline Creating Threads OpenMP Overview The OpenMP model Writing OpenMP programs Creating Threads Data-sharing attributes Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 18 / 91

Creating Threads The parallel construct Directive #pragma omp parallel [ clauses ] s t r u c t u r e d block where clauses can be: num_threads(expression) if(expression) shared(var-list) private(var-list) firstprivate(var-list) default(none shared private firstprivate ) reduction(var-list) We ll see it later Coming shortly! Only in Fortran Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 19 / 91

Creating Threads The parallel construct Specifying the number of threads The number of threads is controlled by an internal control variable (ICV) called nthreads-var When a parallel construct is found a parallel region with a maximum of nthreads-var is created Parallel constructs can be nested creating nested parallelism The nthreads-var can be modified through the omp_set_num_threads API called the OMP_NUM_THREADS environment variable Additionally, the num_threads clause causes the implementation to ignore the ICV and use the value of the clause for that region Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 20 / 91

The parallel construct Creating Threads Avoiding parallel regions Sometimes we only want to run in parallel under certain conditions E.g., enough input data, not running already in parallel,... The if clause allows to specify an expression. When evaluates to false the parallel construct will only use 1 thread Note that still creates a new team and data environment Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 21 / 91

Putting it together Creating Threads Example void main ( ) { #pragma omp parallel... omp_set_num_threads ( 2 ) ; #pragma omp parallel... #pragma omp parallel num_threads ( random()%4+1) if (N>=128)... } Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 22 / 91

Putting it together Creating Threads Example void main ( ) { #pragma omp parallel... An unknown number of threads here. Use OMP_NUM_THREADS omp_set_num_threads ( 2 ) ; #pragma omp parallel... #pragma omp parallel num_threads ( random()%4+1) if (N>=128)... } Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 22 / 91

Putting it together Creating Threads Example void main ( ) { #pragma omp parallel... omp_set_num_threads ( 2 ) ; #pragma omp parallel... A team of two threads here #pragma omp parallel num_threads ( random()%4+1) if (N>=128)... } Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 22 / 91

Putting it together Creating Threads Example void main ( ) { #pragma omp parallel... omp_set_num_threads ( 2 ) ; #pragma omp parallel... #pragma omp parallel num_threads ( random()%4+1) if (N>=128)... A team of [1..4] threads here } Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 22 / 91

Creating Threads API calls Other useful routines int omp_get_num_threads() int omp_get_thread_num() int omp_get_num_procs() int omp_get_max_threads() double omp_get_wtime() Returns the number of threads in the current team Returns the id of the thread in the current team Returns the number of processors in the machine Returns the maximum number of threads that will be used in the next parallel region Returns the number of seconds since an arbitrary point in the past Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 23 / 91

Outline Data-sharing attributes OpenMP Overview The OpenMP model Writing OpenMP programs Creating Threads Data-sharing attributes Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 24 / 91

Data-sharing attributes Data environment A number of clauses are related to building the data environment that the construct will use when executing shared private firstprivate default threadprivate lastprivate reduction Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 25 / 91

Data-sharing attributes Data-sharing attributes Shared When a variable is marked as shared, the variable inside the construct is the same as the one outside the construct In a parallel construct this means all threads see the same variable but not necessarily the same value Usually need some kind of synchronization to update them correctly OpenMP has consistency points at synchronizations Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 26 / 91

Data-sharing attributes Data-sharing attributes Example i n t x =1; #pragma omp parallel shared ( x ) num_threads ( 2 ) { x ++; p r i n t f ( "%d\n", x ) ; } p r i n t f ( "%d\n", x ) ; Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 27 / 91

Data-sharing attributes Data-sharing attributes Example i n t x =1; #pragma omp parallel shared ( x ) num_threads ( 2 ) { x ++; p r i n t f ( "%d\n", x ) ; } p r i n t f ( "%d\n", x ) ; Prints 2 or 3 (three printfs in total) Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 27 / 91

Data-sharing attributes Data-sharing attributes Private When a variable is marked as private, the variable inside the construct is a new variable of the same type with an undefined value In a parallel construct this means all threads have a different variable Can be accessed without any kind of synchronization Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 28 / 91

Data-sharing attributes Data-sharing attributes Example i n t x =1; #pragma omp parallel private ( x ) num_threads ( 2 ) { x ++; p r i n t f ( "%d\n", x ) ; } p r i n t f ( "%d\n", x ) ; Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 29 / 91

Data-sharing attributes Data-sharing attributes Example i n t x =1; #pragma omp parallel private ( x ) num_threads ( 2 ) { x ++; p r i n t f ( "%d\n", x ) ; } p r i n t f ( "%d\n", x ) ; Can print anything (twice, same or different) Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 29 / 91

Data-sharing attributes Data-sharing attributes Example i n t x =1; #pragma omp parallel private ( x ) num_threads ( 2 ) { x ++; p r i n t f ( "%d\n", x ) ; } p r i n t f ( "%d\n", x ) ; Prints 1 Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 29 / 91

Data-sharing attributes Data-sharing attributes Firstprivate When a variable is marked as firstprivate, the variable inside the construct is a new variable of the same type but it is initialized to the original value of the variable In a parallel construct this means all threads have a different variable with the same initial value Can be accessed without any kind of synchronization Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 30 / 91

Data-sharing attributes Data-sharing attributes Example i n t x =1; #pragma omp parallel firstprivate ( x ) num_threads ( 2 ) { x ++; p r i n t f ( "%d\n", x ) ; } p r i n t f ( "%d\n", x ) ; Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 31 / 91

Data-sharing attributes Data-sharing attributes Example i n t x =1; #pragma omp parallel firstprivate ( x ) num_threads ( 2 ) { x ++; p r i n t f ( "%d\n", x ) ; } p r i n t f ( "%d\n", x ) ; Prints 2 (twice) Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 31 / 91

Data-sharing attributes Data-sharing attributes Example i n t x =1; #pragma omp parallel firstprivate ( x ) num_threads ( 2 ) { x ++; p r i n t f ( "%d\n", x ) ; } p r i n t f ( "%d\n", x ) ; Prints 1 Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 31 / 91

Data-sharing attributes Data-sharing attributes What is the default? Static/global storage is shared Heap-allocated storage is shared Stack-allocated storage inside the construct is private Others If there is a default clause, what the clause says none means that the compiler will issue an error if the attribute is not explicitly set by the programmer Otherwise, depends on the construct For the parallel region the default is shared Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 32 / 91

Data-sharing attributes Data-sharing attributes Example i n t x, y ; #pragma omp parallel private ( y ) { x = y = #pragma omp parallel private ( x ) { x = y = } } Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 33 / 91

Data-sharing attributes Data-sharing attributes Example i n t x, y ; #pragma omp parallel private ( y ) { x = x is shared y = #pragma ompy parallel is private private ( x ) } { } x = y = Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 33 / 91

Data-sharing attributes Data-sharing attributes Example i n t x, y ; #pragma omp parallel private ( y ) { x = y = #pragma omp parallel private ( x ) { x = x is private y = } y is shared } Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 33 / 91

Data-sharing attributes Threadprivate storage The threadprivate construct #pragma omp t h r e a d p r i v a t e ( var l i s t ) Can be applied to: Global variables Static variables Class-static members Allows to create a per-thread copy of global variables threadprivate storage persist across parallel regions if the number of threads is the same Threadprivate persistence across nested regions is complex Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 34 / 91

Threaprivate storage Data-sharing attributes Example char foo ( ) { s t a t i c char b u f f e r [ BUF_SIZE ] ; #pragma omp t h r e a d p r i v a t e ( b u f f e r )... } return b u f f e r ; Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 35 / 91

Threaprivate storage Data-sharing attributes Example char foo ( ) { s t a t i c char b u f f e r [ BUF_SIZE ] ; #pragma omp t h r e a d p r i v a t e ( b u f f e r )... Creates one static copy of buffer per thread } return b u f f e r ; Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 35 / 91

Threaprivate storage Data-sharing attributes Example char foo ( ) { s t a t i c char b u f f e r [ BUF_SIZE ] ; #pragma omp t h r e a d p r i v a t e ( b u f f e r )... Now foo can be called by multiple threads at the same time } return b u f f e r ; Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 35 / 91

Threaprivate storage Data-sharing attributes Example char foo ( ) { s t a t i c char b u f f e r [ BUF_SIZE ] ; #pragma omp t h r e a d p r i v a t e ( b u f f e r ) }... return b u f f e r ; foo returns correct address to caller Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 35 / 91

Part II Worksharing constructs Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 36 / 91

Outline The worksharing concept Loop worksharing Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 37 / 91

Outline The worksharing concept The worksharing concept Loop worksharing Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 38 / 91

The worksharing concept Worksharings Worksharing constructs divide the execution of a code region among the threads of a team Threads cooperate to do some work Better way to split work than using thread-ids Lower overhead than using tasks But, less flexible In OpenMP, there are four worksharing constructs: single loop worksharing section workshare Restriction: worksharings cannot be nested Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 39 / 91

Outline Loop worksharing The worksharing concept Loop worksharing Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 40 / 91

Loop worksharing Loop parallelism The for construct #pragma omp for [ clauses ] for ( i n i t expr ; t e s t expr ; inc expr ) where clauses can be: private firstprivate lastprivate(variable-list) reduction(operator:variable-list) schedule(schedule-kind) nowait collapse(n) ordered Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 41 / 91

Loop worksharing The for construct How it works? The iterations of the loop(s) associated to the construct are divided among the threads of the team Loop iterations must be independent Loops must follow a form that allows to compute the number of iterations Valid data types for induction variables are: integer types, pointers and random access iterators (in C++) The induction variable(s) are automatically privatized The default data-sharing attribute is shared It can be merged with the parallel construct: #pragma omp parallel for Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 42 / 91

The for construct Loop worksharing Example void foo ( i n t m, i n t N, i n t M) { i n t i ; #pragma omp parallel for private ( j ) for ( i = 0; i < N; i ++ ) for ( j = 0; j < M; j ++ ) m[ i ] [ j ] = 0; } Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 43 / 91

The for construct Loop worksharing Example void foo ( i n t m, i n t N, i n t M) { i n t i ; New created threads cooperate to execute all the iterations of the loop #pragma omp parallel for private ( j ) for ( i = 0; i < N; i ++ ) for ( j = 0; j < M; j ++ ) m[ i ] [ j ] = 0; } Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 43 / 91

The for construct Loop worksharing Example void foo ( i n t m, i n t N, i n t M) { i n t i ; #pragma omp parallel for private ( j ) for ( i = 0; i The < N; i variable i ++ ) is automatically privatized for ( j = 0; j < M; j ++ ) m[ i ] [ j ] = 0; } Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 43 / 91

The for construct Loop worksharing Example void foo ( i n t m, i n t N, i n t M) { i n t i ; #pragma omp parallel for private ( j ) for ( i = 0; i < N; i ++ ) for ( j = 0; j Must < M; bej explicitly ++ ) privatized m[ i ] [ j ] = 0; } Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 43 / 91

The for construct Loop worksharing Example void foo ( std : : vector <int > &v ) { #pragma omp parallel for for ( std : : vector <int > : : i t e r a t o r i t = v. begin ( ) ; i t < v. end ( ) ; i t ++ ) i t = 0; } Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 44 / 91

The for construct Loop worksharing Example void foo ( std : : vector <int > &v ) { #pragma omp parallel for random access iterators for ( std : : vector <int > : : i t e r a t o r i t = v. begin (and pointers) ( ) ; are valid i t < v. end ( ) ; types i t ++ ) i t = 0; } Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 44 / 91

The for construct Loop worksharing Example void foo ( std : : vector <int > &v ) { #pragma omp parallel for for ( std : : vector <int > : : i t e r a t o r i t = v. begin ( ) ; i t < v. end ( ) ;!= cannot be used in the test expression i t ++ ) i t = 0; } Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 44 / 91

Loop worksharing Removing dependences Example x = 0; for ( i = 0; i < n ; i ++ ) { } v [ i ] = x ; x += dx ; Each iteration x depends on the previous one. Can t be parallelized Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 45 / 91

Loop worksharing Removing dependences Example x = 0; for ( i = 0; i < n ; i ++ ) { } x = i dx ; v [ i ] = x ; But x can be rewritten in terms of i. Now it can be parallelized Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 46 / 91

Loop worksharing The lastprivate clause When a variable is declared lastprivate, a private copy is generated for each thread. Then the value of the variable in the last iteration of the loop is copied back to the original variable A variable can be both firstprivate and lastprivate Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 47 / 91

Loop worksharing The reduction clause A very common pattern is where all threads accumulate some values into a single variable E.g., n += v[i], our heat program,... Using critical or atomic is not good enough Besides being error prone and cumbersome Instead we can use the reduction clause for basic types Valid operators are: +, -, *,,, &, &&,^, min, max User-defined reductions coming soon... The compiler creates a private copy that is properly initialized At the end of the region, the compiler ensures that the shared variable is properly (and safely) updated We can also specify reduction variables in the parallel construct Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 48 / 91

The reduction clause Loop worksharing Example i n t vector_sum ( i n t n, i n t v [ n ] ) { i n t i, sum = 0; #pragma omp parallel for reduction ( + : sum) { for ( i = 0; i < n ; i ++ ) sum += v [ i ] ; } return sum ; } Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 49 / 91

The reduction clause Loop worksharing Example i n t vector_sum ( i n t n, i n t v [ n ] ) { i n t i, sum = 0; #pragma omp parallel for reduction ( + : sum) { Private copy initialized here to the identity value for ( i = 0; i < n ; i ++ ) sum += v [ i ] ; } Shared variable updated here with the partial values of each thread return sum ; } Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 49 / 91

Loop worksharing The schedule clause The schedule clause determines which iterations are executed by each thread If no schedule clause is present then is implementation defined There are several possible options as schedule: STATIC STATIC,chunk DYNAMIC[,chunk] GUIDED[,chunk] AUTO RUNTIME Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 50 / 91

Loop worksharing The schedule clause Static schedule The iteration space is broken in chunks of approximately size N/num threads. Then these chunks are assigned to the threads in a Round-Robin fashion Static, N schedule (Interleaved) The iteration space is broken in chunks of size N. Then these chunks are assigned to the threads in a Round-Robin fashion Characteristics of static schedules Low overhead Good locality (usually) Can have load imbalance problems Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 51 / 91

Loop worksharing The schedule clause Dynamic, N schedule Threads dynamically grab chunks of N iterations until all iterations have been executed. If no chunk is specified, N = 1. Guided, N schedule Variant of dynamic. The size of the chunks deceases as the threads grab iterations, but it is at least of size N. If no chunk is specified, N = 1. Characteristics of dynamic schedules Higher overhead Not very good locality (usually) Can solve imbalance problems Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 52 / 91

Loop worksharing The schedule clause Auto schedule In this case, the implementation is allowed to do whatever it wishes Do not expect much of it as of now Runtime schedule The decision is delayed until the program is run through the sched-nvar ICV. It can be set with: The OMP_SCHEDULE environment variable The omp_set_schedule() API call Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 53 / 91

Loop worksharing The nowait clause When a worksharing has a nowait clause then the implicit barrier at the end of the loop is removed This allows to overlap the execution of non-dependent loops/tasks/worksharings Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 54 / 91

The nowait clause Loop worksharing Example #pragma omp for nowait for ( i = 0; i < n ; i ++ ) v [ i ] = 0; #pragma omp for for ( i = 0; i < n ; i ++ ) a [ i ] = 0; First and second loop are independent, so we can overlap them Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 55 / 91

The nowait clause Loop worksharing Example #pragma omp for nowait for ( i = 0; i < n ; i ++ ) v [ i ] = 0; #pragma omp for for ( i = 0; i < n ; i ++ ) a [ i ] = 0; Side note: you would better fuse the loops in this case Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 55 / 91

The nowait clause Loop worksharing Example #pragma omp for nowait for ( i = 0; i < n ; i ++ ) v [ i ] = 0; #pragma omp for for ( i = 0; i < n ; i ++ ) a [ i ] = v [ i ] v [ i ] ; First and second loops are dependent! No guarantees that the previous iteration is finished Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 56 / 91

Loop worksharing The nowait clause Exception: static schedules If the two (or more) loops have the same static schedule and all have the same number of iterations Example #pragma omp for schedule ( static, M) nowait for ( i = 0; i < n ; i ++ ) v [ i ] = 0; #pragma omp for schedule ( static, M) for ( i = 0; i < n ; i ++ ) a [ i ] = v [ i ] v [ i ] ; Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 57 / 91

Loop worksharing The collapse clause Allows to distribute work from a set of n nested loops Loops must be perfectly nested The nest must traverse a rectangular iteration space Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 58 / 91

Loop worksharing The collapse clause Allows to distribute work from a set of n nested loops Loops must be perfectly nested The nest must traverse a rectangular iteration space Example #pragma omp for collapse ( 2 ) for ( i = 0; i < N; i ++ ) for ( j = 0; j < M; j ++ ) foo ( i, j ) ; i and j loops are folded and iterations distributed among all threads. Both i and j are privatized Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 58 / 91

Break Coffee time! :-) Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 59 / 91

Part III Basic Synchronizations Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 60 / 91

Outline Thread barriers Exclusive access Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 61 / 91

Why synchronization? Mechanisms Threads need to synchronize to impose some ordering in the sequence of actions of the threads. OpenMP provides different synchronization mechanisms: barrier critical atomic taskwait ordered locks Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 62 / 91

Outline Thread barriers Thread barriers Exclusive access Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 63 / 91

Thread Barrier Thread barriers The barrier construct #pragma omp barrier Threads cannot proceed past a barrier point until all threads reach the barrier AND all previously generated work is completed Some constructs have an implicit barrier at the end E.g., the parallel construct Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 64 / 91

Barrier Thread barriers Example #pragma omp parallel { foo ( ) ; #pragma omp barrier bar ( ) ; } Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 65 / 91

Barrier Thread barriers Example #pragma omp parallel { foo ( ) ; #pragma omp barrier bar ( ) ; } Forces all foo occurrences too happen before all bar occurrences Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 65 / 91

Barrier Thread barriers Example #pragma omp parallel { foo ( ) ; #pragma omp barrier bar ( ) ; } Implicit barrier at the end of the parallel region Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 65 / 91

Outline Exclusive access Thread barriers Exclusive access Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 66 / 91

Exclusive access Exclusive access The critical construct #pragma omp critical [ ( name ) ] s t r u c t u r e d block Provides a region of mutual exclusion where only one thread can be working at any given time. By default all critical regions are the same, but you can provide them with names Only those with the same name synchronize Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 67 / 91

Critical construct Exclusive access Example i n t x =1; #pragma omp parallel num_threads ( 2 ) { #pragma omp critical x ++; } p r i n t f ( "%d\n", x ) ; Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 68 / 91

Critical construct Exclusive access Example i n t x =1; #pragma omp parallel num_threads ( 2 ) { #pragma omp critical x ++; Only one thread at a time here } p r i n t f ( "%d\n", x ) ; Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 68 / 91

Critical construct Exclusive access Example i n t x =1; #pragma omp parallel num_threads ( 2 ) { #pragma omp critical x ++; Only one thread at a time here } p r i n t f ( "%d\n", x ) ; Prints 3! Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 68 / 91

Critical construct Exclusive access Example i n t x =1, y =0; #pragma omp parallel num_threads ( 4 ) { #pragma omp critical ( x ) x ++; #pragma omp critical ( y ) y ++; } Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 69 / 91

Critical construct Exclusive access Example i n t x =1, y =0; #pragma omp parallel num_threads ( 4 ) { #pragma omp critical ( x ) x ++; #pragma omp critical ( y ) y ++; } Different names: One thread can update x while another updates y Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 69 / 91

Exclusive access Exclusive access The atomic construct #pragma omp atomic expression Provides an special mechanism of mutual exclusion to do read & update operations Only supports simple read & update expressions E.g., x += 1, x = x - foo() Only protects the read & update part foo() not protected Usually much more efficient than a critical construct Not compatible with critical Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 70 / 91

Atomic construct Exclusive access Example i n t x =1; #pragma omp parallel num_threads ( 2 ) { #pragma omp atomic x ++; } p r i n t f ( "%d\n", x ) ; Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 71 / 91

Atomic construct Exclusive access Example i n t x =1; #pragma omp parallel num_threads ( 2 ) { #pragma omp atomic x ++; Only one thread at a time updates x here } p r i n t f ( "%d\n", x ) ; Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 71 / 91

Atomic construct Exclusive access Example i n t x =1; #pragma omp parallel num_threads ( 2 ) { #pragma omp atomic x ++; } p r i n t f ( "%d\n", x ) ; Prints 3! Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 71 / 91

Atomic construct Exclusive access Example i n t x =1; #pragma omp parallel num_threads ( 2 ) { #pragma omp critical x ++; #pragma omp atomic x ++; } p r i n t f ( "%d\n", x ) ; Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 72 / 91

Atomic construct Exclusive access Example i n t x =1; #pragma omp parallel num_threads ( 2 ) { #pragma omp critical x ++; #pragma omp atomic x ++; } p r i n t f ( "%d\n", x ) ; Different threads can update x at the same time! Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 72 / 91

Atomic construct Exclusive access Example i n t x =1; #pragma omp parallel num_threads ( 2 ) { #pragma omp critical x ++; #pragma omp atomic x ++; } p r i n t f ( "%d\n", x ) ; Prints 3,4 or 5 :( Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 72 / 91

Part IV Practical: OpenMP heat diffusion Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 73 / 91

Outline Heat diffusion Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 74 / 91

Outline Heat diffusion Heat diffusion Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 75 / 91

Heat diffusion Before you start Enter the OpenMP directory to do the following exercises Session4.1-exercise contains the serial version of the Heat application you can use SsGrind on heat-tareador to determine parallelism, and observe diferences among the three algorithms and then annotate the application with OpenMP Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 76 / 91

Heat diffusion Description of the Heat Diffusion app Hands-on Parallel loops The file solver.c implements the computation of the Heat diffusion 1 Annotate the jacobi, redblack, and gauss functions with OpenMP 2 Execute the application with different numbers of processors compare the results evaluate the performance Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 77 / 91

Break Bon appétit!* *Disclaimer: actual food may differ from the image! :-) Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 78 / 91

Part V Programming using a hybrid MPI/OpenMP approach Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 79 / 91

Outline MPI+OpenMP programming Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 80 / 91

Outline MPI+OpenMP programming MPI+OpenMP programming Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 81 / 91

MPI+OpenMP programming Distributed- vs Shared- Memory Programming Distributed-memory programming Separate processes Private variables are unaccessible from others Point-to-point and collective communication Implicit synchronization Shared-memory programming Multiple threads share same address space Explicit synchronization Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 82 / 91

MPI+OpenMP programming Hybrid programming Combining MPI+OpenMP Distributed algorithms spread over nodes Shared memory for computation within each node Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 83 / 91

MPI+OpenMP programming Opportunities When to use MPI+OpenMP Starting from OpenMP and moving to clusters with MPI Starting from MPI and exploiting further parallelism inside each node Improvements OpenMP can solve MPI imbalance Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 84 / 91

MPI+OpenMP programming Alternatives MPI + computational kernels in OpenMP Use OpenMP directives to exploit parallelism between communication phases OpenMP parallel will end before new communication calls MPI inside OpenMP constructs Call MPI from within for-loops, or tasks MPI needs to support multi-threaded mode Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 85 / 91

MPI+OpenMP programming Compiling MPI+OpenMP MPI compiler driver needs the proper OpenMP option mpicc -openmp mpicc -fopenmp Also useful mpicc -show <your command line options and files> It displays the full command line executed by mpicc to compile your program Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 86 / 91

Part VI Practical: MPI+OpenMP heat diffusion Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 87 / 91

Outline MPI+OpenMP Heat diffusion Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 88 / 91

Outline MPI+OpenMP Heat diffusion MPI+OpenMP Heat diffusion Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 89 / 91

Before you start MPI+OpenMP Heat diffusion Enter the Session4.2-exercise directory to do the following exercises Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 90 / 91

MPI+OpenMP Heat diffusion Description of the Heat Diffusion app Hands-on Parallel loops The file solver.c implements the computation of the Heat diffusion 1 Use MPI to distribute the work across nodes 2 Annotate the jacobi, redblack, and gauss functions with OpenMP tasks 3 Execute the application with different numbers of nodes/processors, and compare the results Xavier Martorell (BSC) PATC Parallel Programming Workshop November 26-30, 2012 91 / 91