OpenMP loops. Paolo Burgio.

Similar documents
OpenMP dynamic loops. Paolo Burgio.

Tasking in OpenMP. Paolo Burgio.

OpenMP threading: parallel regions. Paolo Burgio

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Lecture 4: OpenMP Open Multi-Processing

EPL372 Lab Exercise 5: Introduction to OpenMP

Masterpraktikum - High Performance Computing

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

ECE 574 Cluster Computing Lecture 10

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Computational Mathematics

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Introduction to OpenMP

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

CSL 860: Modern Parallel

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Introduction to OpenMP.

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Parallel Programming: OpenMP

Multithreading in C with OpenMP

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Parallel Programming

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Open Multi-Processing: Basic Course

15-418, Spring 2008 OpenMP: A Short Introduction

Overview: The OpenMP Programming Model

High Performance Computing: Tools and Applications

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Introduction to Standard OpenMP 3.1

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN

DPHPC: Introduction to OpenMP Recitation session

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

CS691/SC791: Parallel & Distributed Computing

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS 470 Spring Mike Lam, Professor. OpenMP

Shared Memory Programming with OpenMP

Session 4: Parallel Programming with OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

Distributed Systems + Middleware Concurrent Programming with OpenMP

Lab: Scientific Computing Tsunami-Simulation

OPENMP OPEN MULTI-PROCESSING

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

OpenMP 4.5: Threading, vectorization & offloading

MULTICORE PROGRAMMING WITH OPENMP

Department of Informatics V. Tsunami-Lab. Session 4: Optimization and OMP Michael Bader, Alex Breuer. Alex Breuer

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

Parallel Computing. Lecture 13: OpenMP - I

Data Environment: Default storage attributes

Shared Memory Programming Model

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

OpenMP Fundamentals Fork-join model and data environment

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

Parallel and Distributed Programming. OpenMP

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

Parallel Programming using OpenMP

Parallel Programming using OpenMP

Parallel programming using OpenMP

GCC Developers Summit Ottawa, Canada, June 2006

Parallel Programming with OpenMP. CS240A, T. Yang

Introduction to. Slides prepared by : Farzana Rahman 1

Compiling and running OpenMP programs. C/C++: cc fopenmp o prog prog.c -lomp CC fopenmp o prog prog.c -lomp. Programming with OpenMP*

Shared Memory Programming Paradigm!

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

Parallel Programming in C with MPI and OpenMP

High Performance Computing: Tools and Applications

New Features after OpenMP 2.5

Data sharing in OpenMP

Introduction [1] 1. Directives [2] 7

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

Parallel Programming in C with MPI and OpenMP

DPHPC: Introduction to OpenMP Recitation session

[Potentially] Your first parallel application

OpenMP - Introduction

Shared Memory Parallelism - OpenMP

Multi-core Architecture and Programming

An Introduction to OpenMP

OpenMP: Open Multiprocessing

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell

Shared Memory Programming with OpenMP

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

19.1. Unit 19. OpenMP Library for Parallelism

OpenMP: Open Multiprocessing

Programming Shared Memory Systems with OpenMP Part I. Book

Parallel Programming in C with MPI and OpenMP

Parallel Programming

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Scientific Computing

Transcription:

OpenMP loops Paolo Burgio paolo.burgio@unimore.it

Outline Expressing parallelism Understanding parallel threads Memory Data management Data clauses Synchronization Barriers, locks, critical sections Work partitioning Loops, sections, single work, tasks Execution devices arget Parallel Programming LM 2016/17 2

What we saw so far.. hreads How to create and properly manage a team of threads How to join them with barriers Memory How to create private and shared variables storages How to properly ensure memory consistency among parallel threads Data syncronization How to create locks to implement, e.g., mutual exclusion How to identify Critical Sections How to ensure atomicity on single statements Parallel Programming LM 2016/17 3

Work sharing between threads But..how can we split an existing workload among parallel threads? Say, a loop ypical $c nario 1. Analyze sequential code from customer/boss 2. Parallelize it with OpenMP (for a "generic" parallel machine) 3. une num_threads for specific machine 4. Get money/congratulations from customer/boss Might not be as easy as with PI Montecarlo! How to do 2. without rewriting/re-engineering the code? Parallel Programming LM 2016/17 4

Exercise Let's code! Create an array of N elements Put inside each array element its index, multiplied by '2' arr[0] = 0; arr[1] = 2; arr[2] = 4;...and so on.. Now, do it in parallel with a team of threads N = 19, 19, N < Hint: Act on the boundaries of the loop Hint #2: omp_get_thread_num(), omp_get_num_threads() Example: = 3 = 4 0 7 14 0 5 10 15 1 8 15 1 6 11 16 6 13 18 4 9 14 18 Parallel Programming LM 2016/17 5

Loop partitioning among threads Let's code! Case #1: N multiple of Say, N = 20, = 4 chunk = #iterations for each thread Very simple.. 0 5 10 15 5 chunk 1 6 11 16 4 9 14 19 chunk = N ; i start = thread ID chunk; i end = i start + chunk Parallel Programming LM 2016/17 6

Loop partitioning among threads Let's code! Case #2: N not multiple of Say, N = 19, = 4 chunk = #iterations for each thread (but last) Last thread has less! (chunk last ) chunk 5 0 1 5 6 10 11 15 16 18 chunk_last 4 4 9 14 chunk = N + 1 ; chunk last = N % chunk i start = thread ID chunk; i end = ቊ i start + chunk if not last thread i start + chunk last + 1 if last thread Parallel Programming LM 2016/17 7

"Last thread" Unfortunately, we don't know which thread will be "last" in time But we don't actually care the order in which iterations are executed If there are not depenencies.. And..we do know that 0 <= omp_get_thread_num()< omp_get_num_threads() We choose that last thread as highest number Parallel Programming LM 2016/17 8

Let's put them together! Case #1(N not multiple of ) chunk = N i start = thread ID chunk; i end = i start + chunk Case #2 (N multiple of ) chunk = N + 1 ; chunk last = N % chunk i start = thread ID chunk; i end = ቊ i start + chunk if not last thread i start + chunk last + 1 if last thread Parallel Programming LM 2016/17 9

Work sharing constructs A way to distribute work among parallel threads In a simple, and "elegant" manner Using pragmas OpenMP was born for this OpeMP 2.5 targets regular, loop-based parallelism OpenMP 3.x targets irregular/dynamic parallelism We will see it later Parallel Programming LM 2016/17 10

he for construct #pragma omp for [clause [[,] clause]...] new-line for-loops Where clauses can be: private(list) firstprivate(list) lastprivate(list) linear(list[ : linear-step]) reduction(reduction-identifier : list) schedule([modifier [, modifier]:]kind[, chunk_size]) collapse(n) ordered[(n)] nowait he iterations will be executed in parallel by threads in the team he iterations are distributed across threads executing the parallel region to which the loop region binds for-loops must have Canonical loop form Parallel Programming LM 2016/17 11

Canonical loop form for (init-expr; test-expr; incr-expr) structured-block init-expr; test-expr; incr-expr not void Eases programmers' life More structured Recommended also for "sequential programmers" Preferrable to while and do..while If possible Parallel Programming LM 2016/17 12

Exercise Let's code! Create an array of N elements Put inside each array element its index, multiplied by '2' arr[0] = 0; arr[1] = 2; arr[2] = 4;...and so on.. Now, do it in parallel with a team of threads Using the for construct Parallel Programming LM 2016/17 13

Data sharing clauses #pragma omp for [clause [[,] clause]...] new-line for-loops Where clauses can be: private(list) firstprivate(list) lastprivate(list) linear(list[ : linear-step]) reduction(reduction-identifier : list) schedule([modifier [, modifier]:]kind[, chunk_size]) collapse(n) ordered[(n)] nowait first/private, reduction we already know Private storage, w/ or w/o initialization linear, we won't see Parallel Programming LM 2016/17 14

he lastprivate clause A list item that appears in a lastprivate clause is subject to the private clause semantics Also, the value is updated with the one from the sequentially last iteration of the associated loops Parallel Programming LM 2016/17 15

lastprivate variables and memory Create a new storage for the variables, local to threads, and initialize 11 int a = 11; a 0 Stack #pragma omp for lastprivate(a) \ num_threads(4) { a =... } Process memory Parallel Programming LM 2016/17 16

lastprivate variables and memory Create a new storage for the variables, local to threads, and initialize 11 int a = 11; a 0 Stack #pragma omp for lastprivate(a) \ num_threads(4) { a =... a 0 Stack?? a 1 Stack? } a 2 Stack? a 3 Stack Process memory Parallel Programming LM 2016/17 16

lastprivate variables and memory Create a new storage for the variables, local to threads, and initialize 11 int a = 11; a 0 Stack #pragma omp for lastprivate(a) \ num_threads(4) { a =... a 0 Stack?? a 1 Stack 8 } a 2 Stack? 5 a 3 Stack Process memory Parallel Programming LM 2016/17 16

lastprivate variables and memory Create a new storage for the variables, local to threads, and initialize 11 5 int a = 11; a 0 Stack #pragma omp for lastprivate(a) \ num_threads(4) { a =... t 8 } 5 Process memory Parallel Programming LM 2016/17 16

Exercise Let's code! Modify the "PI Montecarlo" exercise Use the for construct Up to now, each threads executes its "own" loop i from 0 to 2499 Using the for construct, they actually share the loop No need to modify the boundary!!! Check it with printf 0 0 0 0 0 2500 5000 7500 1 1 1 1 1 2501 5001 7501 2499 2499 2499 2499 2499 4999 7499 9999 Parallel Programming LM 2016/17 17

Exercise Let's code! Create an array of N elements Put inside each array element its index, multiplied by '2' arr[0] = 0; arr[1] = 2; arr[2] = 4;...and so on.. Declare the array as lastprivate So you can print its value after the parreg, in the sequential zone Do this at home Parallel Programming LM 2016/17 18

OpenMP 2.5 OpenMP provides three work-sharing constructs Loops Single Sections Parallel Programming LM 2016/17 19

he single construct #pragma omp single [clause [[,] clause]...] new-line structured-block Where clauses can be: private(list) firstprivate(list) copyprivate(list) nowait he enclosed block is executed by only one threads in the team..and what about the other threads? Parallel Programming LM 2016/17 20

Worksharing constructs and barriers Each worksharing construct has an implicit barrier at its end Example: a loop If one thread is delayed, it prevents other threads to do useful work!! #pragma omp parallel num_threads(4) { #pragma omp for for(int i=0; i<n; i++) {... } // (implicit) barrier // USEFUL WORK!! } // (implicit) barrier Parallel Programming LM 2016/17 21

Worksharing constructs and barriers Each worksharing construct has an implicit barrier at its end Example: a loop If one thread is delayed, it prevents other threads to do useful work!! #pragma omp parallel num_threads(4) { #pragma omp for for(int i=0; i<n; i++) {... } // (implicit) barrier // USEFUL WORK!! } // (implicit) barrier Parallel Programming LM 2016/17 21

Nowait clause in the for construct #pragma omp for [clause [[,] clause]...] new-line for-loops Where clauses can be: private(list) firstprivate(list) lastprivate(list) linear(list[ : linear-step]) reduction(reduction-identifier : list) schedule([modifier [, modifier]:]kind[, chunk_size]) collapse(n) ordered[(n)] nowait he nowait clause removes the barrier at the end of a worksharing (WS) construct Applies to all of WS constructs Does not apply to parregs! Parallel Programming LM 2016/17 22

Worksharing constructs and barriers Removed the barrier at the end of WS construct Still, there is a barrier at the end of parreg #pragma omp parallel num_threads(4) \ nowait { #pragma omp for for(int i=0; i<n; i++) {... } // no barrier // USEFUL WORK!! } // (implicit) barrier Parallel Programming LM 2016/17 23

he sections construct #pragma omp sections [clause[ [,] clause]... ] new-line { [#pragma omp section new-line] structured-block [#pragma omp section new-line] structured-block... } Where clauses can be: private(list) firstprivate(list) lastprivate(list) reduction(reduction-identifier : list) nowait Each section contains code that is executed by a single thread A "switch" for threads Clauses, we already know.. lastprivate items are updated by the section executing last (in time) Parallel Programming LM 2016/17 24

Sections vs. loops Loops implement data-parallel paradigm Same work, on different data Aka: data decomposition, SIMD, SPMD Sections implement task-based paradigm Different work, on the same or different data Aka: task decomposition, MPSD, MPMD Parallel Programming LM 2016/17 25

he master construct #pragma omp master new-line structured-block No clauses he structured block is executed only by master thread "Similar" to the single construct It is not a work-sharing construct here is no barrier implied!! Parallel Programming LM 2016/17 26

Combined parreg+ws For each WS construct, there is also a compact form In this case, clauses to both constructs apply #pragma omp parallel { #pragma omp for for(int i=0; i<n; i++) {... } } // (implicit) barrier #pragma omp parallel #pragma omp for = for(int i=0; i<n; i++) {... =} } // (implicit) barrier #pragma omp parallel for for(int i=0; i<n; i++) {... // (implicit) barrier Parallel Programming LM 2016/17 27

How to run the examples Let's code! Download the Code/ folder from the course website Compile $ gcc fopenmp code.c -o code Run (Unix/Linux) $./code Run (Win/Cygwin) $./code.exe Parallel Programming LM 2016/17 28

References "Calcolo parallelo" website http://algogroup.unimore.it/people/marko/courses/programmazione_parallela/ My contacts paolo.burgio@unimore.it http://hipert.mat.unimore.it/people/paolob/ OpenMP specifications http://www.openmp.org http://www.openmp.org/mp-documents/openmp-4.5.pdf http://www.openmp.org/mp-documents/openmp-4.5-1115-cpp-web.pdf A lot of stuff around https://computing.llnl.gov/tutorials/openmp/ Parallel Programming LM 2016/17 29