OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Similar documents
ECE 574 Cluster Computing Lecture 10

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Lecture 4: OpenMP Open Multi-Processing

Introduction to OpenMP

EPL372 Lab Exercise 5: Introduction to OpenMP

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

HPCSE - I. «OpenMP Programming Model - Part I» Panos Hadjidoukas

Shared Memory Parallelism - OpenMP

Distributed Systems + Middleware Concurrent Programming with OpenMP

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

Shared Memory Parallelism using OpenMP

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Parallel Programming using OpenMP

Parallel Programming using OpenMP

Introduction to OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

15-418, Spring 2008 OpenMP: A Short Introduction

Multithreading in C with OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

Introduction to. Slides prepared by : Farzana Rahman 1

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

OpenMP - Introduction

Assignment 1 OpenMP Tutorial Assignment

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell

Introduction to OpenMP.

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

OpenMP, Part 2. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2015

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

High Performance Computing: Tools and Applications

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Overview: The OpenMP Programming Model

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Introduction to OpenMP

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

Session 4: Parallel Programming with OpenMP

Shared Memory Programming Model

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

A brief introduction to OpenMP

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

Shared memory parallel computing

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Parallel programming using OpenMP

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

Synchronisation in Java - Java Monitor

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

Alfio Lazzaro: Introduction to OpenMP

Introduction to Standard OpenMP 3.1

CSL 860: Modern Parallel

Parallel Computing. Prof. Marco Bertini

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

OpenMP Fundamentals Fork-join model and data environment

Introduction to OpenMP

Computational Mathematics

Parallel Programming

GLOSSARY. OpenMP. OpenMP brings the power of multiprocessing to your C, C++, and. Fortran programs. BY WOLFGANG DAUTERMANN

Data Environment: Default storage attributes

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Parallel Programming in C with MPI and OpenMP

CS691/SC791: Parallel & Distributed Computing

Shared Memory Parallelism

In-Class Guerrilla Development of MPI Examples

Shared Memory programming paradigm: openmp

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

An Introduction to OpenMP

OpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder

Shared Memory Programming Paradigm!

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Computer Architecture

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

High Performance Computing

Introduction to OpenMP. Rogelio Long CS 5334/4390 Spring 2014 February 25 Class

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Shared Memory Programming with OpenMP

Parallel Programming in C with MPI and OpenMP

Shared Memory Programming with OpenMP

Parallel Programming

CS 5220: Shared memory programming. David Bindel

Introduction to OpenMP

Concurrent Programming with OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

<Insert Picture Here> OpenMP on Solaris

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

Introduction to OpenMP

More Advanced OpenMP. Saturday, January 30, 16

Introduction to OpenMP

Parallel Computing. Lecture 13: OpenMP - I

Multi-core Architecture and Programming

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Transcription:

OpenMP António Abreu Instituto Politécnico de Setúbal 1 de Março de 2013 António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 1 / 37

openmp what? It s an Application Program Interface (API) that allows parallel programs to be explicitly and simply developed, in C/C++, for multi-platform, shared memory, multiprocessor computers (including Solaris, AIX, HP-UX, GNU/Linux, Mac OS X, and Windows platforms), supported by the major computer hardware and software vendors (including AMD, IBM, Intel, Cray, HP, Fujitsu, Nvidia, NEC, Microsoft, Texas Instruments, Oracle Corporation, and others.). António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 2 / 37

cores and memory Multicore computers have a memory system where some memories are shared while others are not. The next figure makes this distinction clear. TLB stands for Translation Lookaside Buffer, which is an address cache. When making parallel programs one must know which memory is shared and which memory is not. António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 3 / 37

Fork join OpenMP is based on multithreading, i.e., a form of parallelization whereby a master thread forks a specified number of slave threads, with the runtime environment allocating threads to different processors. António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 4 / 37

How many cores does my machine have? In linux, the file /proc/cpuinfo contains a lot of information about the hardware of the machine. Typing less /proc/cpuinfo allows one to see it all. To see info about memory, see the contents of the file /proc/meminfo. The first number one wants to see is the one corresponding to MemTotal. In order to use openmp, one has to have a propoer compiler. In linux, GCC 4.2 or higher supports openmp. To see the version of your (linux) compiler, type the command gcc -v. António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 5 / 37

António num_threads Abreu (Instituto Politécnico (integer-expression) de Setúbal) OpenMP 1 de Março de 2013 6 / 37 parallel directive #pragma omp parallel [clause...] structured_block where clause can be newline if (scalar_expression) private (list) shared (list) default (shared none) firstprivate (list) reduction (operator: list) copyin (list)

Hello world #include <stdio.h> #include <omp.h> int main(void) #pragma omp parallel int ID = omp_get_thread_num(); printf("hello (%d)\n",id); printf("world (%d)\n",id); printf("! (%d)\n",id); return 0; António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 7 / 37

Compile with gcc -fopenmp hello.c -o hello Hello (0) world (0)! (0) Hello (1) world (1)! (1) Hello (2) world (2)! (2) Hello (3) world (3)! (3) António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 8 / 37

The code between the curly brackets (after the pragma directive) is set to execute in a predetermined number of threads. After the first curly bracket there is a fork, i.e., the master thread creates a team of parallel threads, and after the second curly bracket there is a join, i.e., the master thread continues execution after all the slave threads end. The second curly bracket constitutes a barrier, of which only the master thread passes. The number of threads is typically set to the number of cores in the microprocessor; it can be set by the command line export OMP_NUM_THREADS=4. omp_get_thread_num() is a function that returns the Id of the respective thread. The master thread has Id 0 and makes part of the thread team. António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 9 / 37

We observe an ordered output, but sometimes this may not happen; in fact there is a race condition because the four threads share the standard output. Note that openmp is not necessarily implemented identically by all vendors. Also, it does not provide check for data dependencies, data conflicts, race conditions, or deadlocks. In particular, it does not guarantee that input or output to the same file is synchronous when executed in parallel. It is up to the programmer to synchronize input and output. António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 10 / 37

Synchronization Constructs barriers #include <omp.h> #include <stdio.h> #include <stdlib.h> int main (int argc, char *argv[]) int th_id, nthreads; #pragma omp parallel private(th_id) th_id = omp_get_thread_num(); printf("hello World from thread %d\n", th_id); #pragma omp barrier if ( th_id == 0 ) nthreads = omp_get_num_threads(); printf("there are %d threads\n",nthreads); return EXIT_SUCCESS; António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 11 / 37

Hello World from thread 1 Hello World from thread 3 Hello World from thread 0 Hello World from thread 2 There are 4 threads António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 12 / 37

Barriers are a synchronization primitive. This means that all threads in the team wait for the last one to reach the barrier. At that moment, all threads in the team resume execution in parallel. If there is a thread that does not reach the barrier, all threads in the team wait, and the process hangs without any work being produced. António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 13 / 37

Quiz If we comment the barrier pragma in the code above the output will be, Hello World from thread 0 There are 4 threads Hello World from thread 3 Hello World from thread 1 Hello World from thread 2 Explain why. António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 14 / 37

Quiz If we add the code printf("bye from thread %d\n", th_id); after the if, what would be the output? António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 15 / 37

Workshare directives for #pragma omp for [clause...] for_loop newline where clause can be, schedule (type [,chunk]) ordered private (list) firstprivate (list) lastprivate (list) shared (list) reduction (operator: list) collapse (n) nowait António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 16 / 37

parallel for example #include <omp.h> #define CHUNKSIZE 100 #define N 1000 main () int i, chunk = CHUNKSIZE; float a[n], b[n], c[n]; /* Some initializations */ for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; #pragma omp parallel shared(a,b,c,chunk) private(i) #pragma omp for schedule(dynamic,chunk) nowait for (i=0; i < N; i++) c[i] = a[i] + b[i]; /* end of parallel section */ António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 17 / 37

The for pragma asks the compiler to create threads from the N iterations of the for loop. The clause schedule informs the OS (operating system) about how to schedule those threads. In this case, the scheduling policy is dynamic, which means that threads are dynamically assigned on a first-come-first-serve basis. In this case each thread will execute chunk (i.e., 100) iterations of the total of 1000 in the loop. The nowait clause makes the implied barrier at the end of the for directive to be ignored. Put differently, if there was not such a clause, all team threads stop at the end of the for primitive, and only thread 0 would continue past this point. António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 18 / 37

Quiz In the following program, which for cycle is executed in parallel: the first, or both? Before answering, note that the clauses parallel and for are combined in a single one. This is valid. #include <stdio.h> int main(int argc, char *argv[]) const int N = 100; int i, a[n]; #pragma omp parallel for for (i = 0; i < N; i++) a[i] = 2 * i; for (i = 0; i < N; i++) printf("%d ",a[i]); return 0; António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 19 / 37

Workshare directives sections #pragma omp sections [clause...] #pragma omp section newline structured_block #pragma omp section newline structured_block newline where clause can be, private (list) firstprivate (list) lastprivate (list) reduction (operator: list) nowait António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 20 / 37

section directive example #include <stdio.h> #include <omp.h> int main(void) #pragma omp parallel sections #pragma omp section printf("hello from thread %d\n",omp_get_thread_num()); #pragma omp section printf("hello from thread %d\n",omp_get_thread_num()); #pragma omp section printf("hello from thread %d\n",omp_get_thread_num()); printf("bye from thread %d\n",omp_get_thread_num()); António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 21 / 37

A few executions First execution hello from thread 0 hello from thread 0 hello from thread 0 Bye from thread 0 Second execution hello from thread 0 hello from thread 1 hello from thread 3 Bye from thread 0 Third execution hello from thread 2 hello from thread 1 hello from thread 0 Bye from thread 0 António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 22 / 37

Another example #include <stdio.h> #include <omp.h> int main(void) int i=0; #pragma omp parallel sections if (i==1) #pragma omp section printf("hello from thread %d\n",omp_get_thread_num()); #pragma omp section printf("hello from thread %d\n",omp_get_thread_num()); #pragma omp section printf("hello from thread %d\n",omp_get_thread_num()); printf("bye from thread %d\n",omp_get_thread_num()); António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 23 / 37

Unique result hello from thread 0 hello from thread 0 hello from thread 0 Bye from thread 0 Since the condition is false, the team of threads is not created; but the master thread stands. Note that the assigned work (three blocks of code) is executed serially; so the if clause permits to parallelize work or not (i.e., to seriallize it), and the decision is made at runtime. Also, there is an implicit barrier at the end of each section. This explains why Bye from... (in the last two examples) is always the last message to print. António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 24 / 37

Clause reduction reduction (operator: list) At the creation of a team of threads the variables in list are created as private. At the end of the threads in the team, operator is applied to the variables in list, a process known as reduction, and the final result is written back to the variables in list, now seen as global shared variables. Variables in list must be scalar; not arrays or structures. António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 25 / 37

#include <stdio.h> #include <omp.h> int main(void) int t=0; omp_set_num_threads(4); #pragma omp parallel reduction(+:t) t = omp_get_thread_num() + 1; printf("local %d\n", t); printf("reduction %d\n", t); António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 26 / 37

Result local 1 local 2 local 3 local 4 reduction 10 The function of omp_set_num_threads() is self explanatory. As expected, it cannot be called from a parallelized block of code. António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 27 / 37

Synchronization Constructs atomic Used to identify a memory location that should not be modified simultaneously by more than one thread in the team. In other words, it provides an atomic access to the memory location. #pragma omp atomic <statement_block> The directive applies only to a single statement. António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 28 / 37

Synchronization Constructs single Used when there is a block of code that must be executed by a single thread in the team. Note that by no means this implies that the code is made atomic. It may happen that other threads (outside this team) access the same memory location, thus creating a race condition. #pragma omp single [clause[[,] clause]...] statement_block Threads in the team that do not execute this directive, wait at the end of the code block, unless a nowait clause is specified. António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 29 / 37

Synchronization Constructs master Used to identify a block of code that must executed only by the master thread. #pragma omp master statement_block António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 30 / 37

Synchronization Constructs critical Specifies a block of code that must be executed by only one thread at a time. In other words, if the code in a critical region is executing, no other thread with that code will run in parallel. #pragma omp critical [(name)] statement_block Different critical regions with the same name are treated as the same region. All unnamed critical regions are treated as the same region. António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 31 / 37

Example #include <omp.h> main() int x; x = 0; #pragma omp parallel shared(x) #pragma omp critical x = x + 1; /* end of parallel section */ António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 32 / 37

Synchronization Constructs flush This directive identifies a point at which a consistent view of memory must exist, i.e., thread-visible variables are written back to memory is response to this directive. #pragma omp flush [ (list) ] Remember the first figure of these course notes. This directive forces the data in the data cache of each core to be written to the shared unified cache memory (and not necessarily to the main memory; that decision is made by the virtual memory system). António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 33 / 37

openmp functions about threads #include <stdio.h> #include <omp.h> int main(void) printf("omp_get_max_threads=%d\n",omp_get_max_threads()); omp_set_num_threads(2); printf("omp_get_num_procs=%d\n",omp_get_num_procs()); #pragma omp parallel printf("omp_get_thread_num=%d\n",omp_get_thread_num()); printf("omp_get_thread_num=%d\n",omp_get_thread_num()); António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 34 / 37

omp_get_max_threads=4 omp_get_num_procs=2 omp_get_thread_num=0 omp_get_thread_num=1 omp_get_thread_num=0 omp_get_num_procs() returns the number of processors in the machine. António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 35 / 37

Synchronization locks omp_lock_t lck; omp_init_lock(&lck); #pragma omp parallel private (tmp,id) id = omp_get_thread_num(); tmp = do_lots_of_work(id); // critical region wrt tmp omp_set_lock(&lck); printf(%d %d",id,tmp); // atomic access to id and tmp omp_unset_lock(&lck); tmp = do_more_lots_of_work(id); // critical region wrt tmp omp_destroy_lock(&lck); António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 36 / 37

Bibliography wikipedia http://openmp.org https://computing.llnl.gov/tutorials/openmp/ http://msdn.microsoft.com/ http://publib.boulder.ibm.com António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 37 / 37