OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

Similar documents
Introduction to OpenMP

Introduction to OpenMP

Tasking and OpenMP Success Stories

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Lecture 4: OpenMP Open Multi-Processing

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

OpenMP 4.5: Threading, vectorization & offloading

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

OpenMP for Accelerators

Programming Shared-memory Platforms with OpenMP. Xu Liu

OpenMP Tutorial. Dirk Schmidl. IT Center, RWTH Aachen University. Member of the HPC Group Christian Terboven

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

Review. Lecture 12 5/22/2012. Compiler Directives. Library Functions Environment Variables. Compiler directives for construct, collapse clause

Multithreading in C with OpenMP

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS 470 Spring Mike Lam, Professor. OpenMP

Advanced OpenMP. Tasks

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

A brief introduction to OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

Parallel Programming

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Overview: The OpenMP Programming Model

Review. Tasking. 34a.cpp. Lecture 14. Work Tasking 5/31/2011. Structured block. Parallel construct. Working-Sharing contructs.

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

Multi-core Architecture and Programming

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

Introduction to. Slides prepared by : Farzana Rahman 1

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Data Environment: Default storage attributes

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

EPL372 Lab Exercise 5: Introduction to OpenMP

POSIX Threads and OpenMP tasks

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

HPCSE - II. «OpenMP Programming Model - Tasks» Panos Hadjidoukas

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

CSL 860: Modern Parallel

Shared Memory Parallelism using OpenMP

Parallel Programming using OpenMP

Distributed Systems + Middleware Concurrent Programming with OpenMP

Parallel Programming using OpenMP

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Shared Memory Parallelism - OpenMP

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

15-418, Spring 2008 OpenMP: A Short Introduction

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Masterpraktikum - High Performance Computing

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

ECE 574 Cluster Computing Lecture 10

Concurrent Programming with OpenMP

Parallel Programming in C with MPI and OpenMP

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

ME759 High Performance Computing for Engineering Applications

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

What else is available besides OpenMP?

Introduction to OpenMP.

High Performance Computing: Tools and Applications

Introduction to OpenMP

OPENMP OPEN MULTI-PROCESSING

[Potentially] Your first parallel application

Programming with OpenMP*

EE/CSCI 451: Parallel and Distributed Computation

Introduction to OpenMP

Shared Memory programming paradigm: openmp

Programming Shared-memory Platforms with OpenMP

Advanced OpenMP Features

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Session 4: Parallel Programming with OpenMP

Allows program to be incrementally parallelized

Introduction to OpenMP

Parallel Computing Parallel Programming Languages Hwansoo Han

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

Cluster Computing 2008

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

OpenMP Fundamentals Fork-join model and data environment

CS691/SC791: Parallel & Distributed Computing

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

Parallel Programming with OpenMP. CS240A, T. Yang

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Shared Memory Programming with OpenMP

Transcription:

OpenMP Overview in 30 Minutes Christian Terboven <terboven@rz.rwth-aachen.de> 06.12.2010 / Aachen, Germany Stand: 03.12.2010 Version 2.3 Rechen- und Kommunikationszentrum (RZ)

Agenda OpenMP: Parallel Regions, Worksharing, Synchronization Example: Pi OpenMP: Tasking Example: Fibonacci RZ: Christian Terboven Folie 2

OpenMP: Parallel Regions, Worksharing, Synchronization RZ: Christian Terboven Folie 3

OpenMP Overview (1/2) OpenMP: Shared-Memory Parallel Programming Model. Memory All processors/cores access a shared main memory. Crossbar / Bus Cache Cache Cache Cache Proc Proc Proc Proc Real architectures are more complex, as we will see later / as you just have seen. Parallelization in OpenMP employs threads. RZ: Christian Terboven Folie 4

OpenMP Overview (2/2) Master Thread Slave Threads Slave Threads Worker Threads Serial Part Parallel Region OpenMP programs start with just one thread: The Master. Worker threads are spawned at Parallel Regions. Together with the Master they form a Team. In between Parallel Regions the Worker threads are put to sleep. Concept: Fork-Join. Allows for an incremental parallelization! RZ: Christian Terboven Folie 5

Directives and Structured Blocks The parallelism has to be expressed explicitly. C/C++ #pragma omp parallel... structured block... Structured Block Specification of number of threads: Exactly one entry point at the top Environment variable: Exactly one exit point at the bottom Branching in or out is not allowed Terminating the program is allowed (abort) OMP_NUM_THREADS= Or: Via num_threads clause: #pragma omp parallel \ num_threads(num) RZ: Christian Terboven Folie 6

Worksharing (1/2) If only the parallel construct is used, each thread executes the Structured Block. Program Speedup: Worksharing OpenMP s most common Worksharing construct: for C/C++ int i; #pragma omp parallel for for (i = 0; i < 100; i++) a[i] = b[i] + c[i]; Distribution of loop iterations over all threads in a Team. Scheduling of the distribution can be influenced. Loops often account for most of the program runtime! RZ: Christian Terboven Folie 7

Worksharing (2/2) for-construct: OpenMP allows to influence how the iterations are scheduled among the threads of the team, via the schedule clause: schedule(static [, chunk]): Iteration space divided into blocks of chunk size, blocks are assigned to threads in a round-robin fashion. If chunk is not specified: #threads blocks. schedule(dynamic [, chunk]): Iteration space divided into blocks of chunk (not specified: 1) size, blocks are scheduled to threads in the order in which threads finish previous blocks. schedule(guided [, chunk]): Similar to dynamic, but block size starts with implementation-defined value, then is decreased exponentially down to chunk. Default on most implementations is schedule(static). RZ: Christian Terboven Folie 8

Scoping (1/2) Challenge of Shared-Memory parallelization: Managing the Data Environment. Scoping in OpenMP: Dividing variables in shared and private: private-list and shared-list on Parallel Region private-list and shared-list on Worksharing constructs Default is shared Loop control variables on for-constructs are private Non-static variables local to Parallel Regions are private private: A new uninitialized instance is created for each thread firstprivate: Initialization with Master s value lastprivate: Value of last loop iteration is written back to Master Static variables are shared RZ: Christian Terboven Folie 9

Scoping (2/2) Global / static variables can be privatized with the threadprivate directive One instance is created for each thread Before the first parallel region is encountered Instance exists until the program ends Does not work (well) with nested Parallel Region Based on thread-local storage (TLS) C/C++ TlsAlloc (Win32-Threads), pthread_key_create (Posix-Threads), keyword thread (GNU extension) static int i; #pragma omp threadprivate(i) RZ: Christian Terboven Folie 10

Synchronization (1/4) Can all loops be parallelized with for-constructs? No! Simple test: If the results differ when the code is executed backwards, the loop iterations are not independent. BUT: This test alone is not sufficient: C/C++ int i; #pragma omp parallel for for (i = 0; i < 100; i++) s = s + a[i]; Data Race: If between two synchronization points at least one thread writes to a memory location from which at least one other thread reads, the result is not deterministic (race condition). RZ: Christian Terboven Folie 11

Synchronization (2/4) Pseudo-Code Here: 4 Threads do i = 0, 99 s = s + a(i) end do do i = 0, 24 s = s + a(i) end do do i = 25, 49 s = s + a(i) end do do i = 50, 74 s = s + a(i) end do do i = 75, 99 s = s + a(i) end do Memory S A(0)... A(99) RZ: Christian Terboven Folie 12

Synchronization (3/4) A Critical Region is executed by all threads, but by only one thread simultaneously (Mutual Exclusion). C/C++ #pragma omp critical (name)... structured block... Do you think this solution scales well? C/C++ int i; #pragma omp parallel for for (i = 0; i < 100; i++) #pragma omp critical s = s + a[i]; RZ: Christian Terboven Folie 13

Synchronization (4/4): Reductions In a reduction-operation the operator is applied to all variables in the list. The variables have to be shared. reduction(operator:list) The result is provided in the associated reduction variable C/C++ #pragma omp parallel for reduction(+:s) for(i = 0; i < 99; i++) s = s + a[i]; Possible reduction operators: *, -, &,, &&,, ^ RZ: Christian Terboven Folie 14

Runtime Library C and C++: If OpenMP is enabled during compilation, the preprocessor symbol _OPENMP is defined. To use the OpenMP runtime library, the header omp.h has to be included. omp_set_num_threads(int): The specified number of threads will be used for the parallel region encountered next. int omp_get_num_threads: Returns the number of threads in the current team. int omp_get_thread_num(): Returns the number of the calling thread in the team, the Master has always the id 0. Additional functions are available, e.g. to provide locking functionality. RZ: Christian Terboven Folie 15

Example: Pi RZ: Christian Terboven Folie 16

Example: Pi o Simple example: calculate Pi by integration double f(double x) return (double)4.0 / ((double)1.0 + (x*x)); void computepi() double h = (double)1.0 / (double)inumintervals; double sum = 0, x; for (int i = 1; i <= inumintervals; i++) x = h * ((double)i - (double)0.5); sum += f(x); mypi = h * sum; 1 4 (1 x 0 2 ) dx RZ: Christian Terboven Folie 17

OpenMP: Tasking RZ: Christian Terboven Folie 18

How to parallelize a While-loop? How would you parallelize this code? typedef list<double> dlist; dlist mylist; /* fill mylist with tons of items */ dlist::iterator it = mylist.begin(); while (it!= mylist.end()) *it = processlistitem(*it); it++; One possibility: Create a fixed-sized array containing all list items and a parallel loop running over this array Concept: Inspector / Executor RZ: Christian Terboven Folie 19

How to parallelize a While-loop! Or: Use Tasking in OpenMP 3.0 #pragma omp parallel #pragma omp single dlist::iterator it = mylist.begin(); while (it!= mylist.end()) #pragma omp task *it = processlistitem(*it); it++; All while-loop iterations are independent from each other! RZ: Christian Terboven Folie 20

The task directive C/C++ #pragma omp task [clause [[,] clause]... ]... structured block... Each encountering thread creates a new Task Code and data is being packaged up Tasks can be nested Into another Task directive Into a Worksharing construct Data scoping clauses: shared(list) private(list) firstprivate(list) default(shared none) RZ: Christian Terboven Folie 21

Task synchronization (1/2) At OpenMP barrier (implicit or explicit) All tasks created by any thread of the current Team are guaranteed to be completed at barrier exit Task barrier: taskwait Encountering Task suspends until child tasks are complete Only direct childs, not descendants! C/C++ #pragma omp taskwait RZ: Christian Terboven Folie 22

Task synchronization (2/2) Simple example of Task synchronization in OpenMP 3.0: #pragma omp parallel num_threads(np) #pragma omp task function_a(); #pragma omp barrier #pragma omp single #pragma omp task function_b(); np Tasks created here, one for each thread All Tasks guaranteed to be completed here 1 Task created here B-Task guaranteed to be completed here RZ: Christian Terboven Folie 23

Tasks in OpenMP: Data Scoping Some rules from Parallel Regions apply: Static and Global variables are shared Automatic Storage (local) variables are private If no default clause is given: Orphaned Task variables are firstprivate by default! Non-Orphaned Task variables inherit the shared attribute! Variables are firstprivate unless shared in the enclosing context So far no verification tool is available to check Tasking programs for correctness! RZ: Christian Terboven Folie 24

Example: Fibonacci RZ: Christian Terboven Folie 25

Recursive approach to compute Fibonacci int main(int argc, char* argv[]) [...] fib(input); [...] int fib(int n) if (n < 2) return n; int x = fib(n - 1); int y = fib(n - 2); return x+y; On the following slides we will discuss three approaches to parallelize this recursive code with Tasking. RZ: Christian Terboven Folie 26

First version parallelized with Tasking (omp-v1) int main(int argc, char* argv[]) [...] #pragma omp parallel #pragma omp single fib(input); [...] int fib(int n) if (n < 2) return n; int x, y; #pragma omp task shared(x) x = fib(n - 1); #pragma omp task shared(y) y = fib(n - 2); #pragma omp taskwait return x+y; o Only one Task / Thread enters fib() from main(), it is responsable for creating the two initial work tasks o Taskwait is required, as otherwise x and y would be lost RZ: Christian Terboven Folie 27

Speedup Scalability measurements (1/3) Overhead of task creation prevents better scalability! Speedup of Fibonacci with Tasks 9 8 7 6 5 4 3 optimal omp-v1 2 1 0 1 2 4 8 #Threads RZ: Christian Terboven Folie 28

Improved parallelization with Tasking (omp-v2) Improvement: Don t create yet another task once a certain (small enough) n is reached int main(int argc, char* argv[]) [...] #pragma omp parallel #pragma omp single fib(input); [...] int fib(int n) if (n < 2) return n; int x, y; #pragma omp task shared(x) \ if(n > 30) x = fib(n - 1); #pragma omp task shared(y) \ if(n > 30) y = fib(n - 2); #pragma omp taskwait return x+y; RZ: Christian Terboven Folie 29

Speedup Scalability measurements (2/3) Speedup is ok, but we still have some overhead when running with 4 or 8 threads 9 8 7 6 Speedup of Fibonacci with Tasks 5 4 3 optimal omp-v1 omp-v2 2 1 0 1 2 4 8 #Threads RZ: Christian Terboven Folie 30

Improved parallelization with Tasking (omp-v3) Improvement: Skip the OpenMP overhead once a certain n is reached (no issue w/ production compilers) int main(int argc, char* argv[]) [...] #pragma omp parallel #pragma omp single fib(input); [...] int fib(int n) if (n < 2) return n; if (n <= 30) return serfib(n); int x, y; #pragma omp task shared(x) x = fib(n - 1); #pragma omp task shared(y) y = fib(n - 2); #pragma omp taskwait return x+y; RZ: Christian Terboven Folie 31

Speedup Scalability measurements (3/3) Everything ok now Speedup of Fibonacci with Tasks 9 8 7 6 5 4 3 2 optimal omp-v1 omp-v2 omp-v3 1 0 1 2 4 8 #Threads RZ: Christian Terboven Folie 32

The End Thank you for your attention. RZ: Christian Terboven Folie 33