Distributed Systems + Middleware Concurrent Programming with OpenMP

Similar documents
OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

15-418, Spring 2008 OpenMP: A Short Introduction

Lecture 4: OpenMP Open Multi-Processing

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

EPL372 Lab Exercise 5: Introduction to OpenMP

Alfio Lazzaro: Introduction to OpenMP

Using OpenMP. Rebecca Hartman-Baker Oak Ridge National Laboratory

CS691/SC791: Parallel & Distributed Computing

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

ECE 574 Cluster Computing Lecture 10

Introduction to OpenMP

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

Introduction to. Slides prepared by : Farzana Rahman 1

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

Shared Memory Parallelism - OpenMP

Multithreading in C with OpenMP

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Shared Memory Programming with OpenMP

Parallel programming using OpenMP

Parallel Programming: OpenMP

Module 11: The lastprivate Clause Lecture 21: Clause and Routines. The Lecture Contains: The lastprivate Clause. Data Scope Attribute Clauses

Shared Memory Parallelism using OpenMP

Parallel Programming with OpenMP. CS240A, T. Yang

Programming Shared Memory Systems with OpenMP Part I. Book

An Introduction to OpenMP

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Review. Tasking. 34a.cpp. Lecture 14. Work Tasking 5/31/2011. Structured block. Parallel construct. Working-Sharing contructs.

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

OpenMP Fundamentals Fork-join model and data environment

Introduction to Standard OpenMP 3.1

Introduction to OpenMP. Rogelio Long CS 5334/4390 Spring 2014 February 25 Class

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Introduction to OpenMP

Introduction to OpenMP.

Parallel Numerical Algorithms

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

ME759 High Performance Computing for Engineering Applications

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

Shared Memory Programming Model

Multi-core Architecture and Programming

Review. Lecture 12 5/22/2012. Compiler Directives. Library Functions Environment Variables. Compiler directives for construct, collapse clause

OpenMP - Introduction

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Parallel Programming in C with MPI and OpenMP

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

High Performance Computing: Tools and Applications

A brief introduction to OpenMP

Parallel Programming

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

OPENMP OPEN MULTI-PROCESSING

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

Parallel Programming in C with MPI and OpenMP

Using OpenMP. Shaohao Chen Research Boston University

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Introduction to OpenMP

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Shared memory parallel computing

Overview: The OpenMP Programming Model

[Potentially] Your first parallel application

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Computer Architecture

Introduction to OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Using OpenMP. Shaohao Chen High performance Louisiana State University

Parallel Computing. Prof. Marco Bertini

Parallel Programming in C with MPI and OpenMP

DPHPC: Introduction to OpenMP Recitation session

Parallel Programming using OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

Introduction to OpenMP

Parallel Programming using OpenMP

Introduction [1] 1. Directives [2] 7

Allows program to be incrementally parallelized

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

CSL 860: Modern Parallel

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

Shared Memory Programming Paradigm!

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

Programming with OpenMP*

Data Environment: Default storage attributes

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared Memory Programming with OpenMP

Introduction to OpenMP

OpenMP on Ranger and Stampede (with Labs)

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Parallel and Distributed Programming. OpenMP

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Transcription:

Distributed Systems + Middleware Concurrent Programming with OpenMP Gianpaolo Cugola Dipartimento di Elettronica e Informazione Politecnico, Italy cugola@elet.polimi.it http://home.dei.polimi.it/cugola http://corsi.dei.polimi.it/distsys

References Using OpenMP: Portable Shared Memory Parallel Programming, Barbara Chapman, Gabriele Jost and Ruud van der Pas OpenMP.org http://openmp.org/ OpenMP Tutorial https://computing.llnl.gov/tutorials/openmp/

OpenMP in general OpenMP stands for Open Multi Processing Its an API for multi-platform, shared memory, parallel programming Available on Linux, MS Windows, Mac OS X Offered for Fortran and C/C++ languages Based on compiler directives, library routines, and environment variables History The OpenMP Architecture Review Board (ARB) published its first API specification, OpenMP for Fortran 1.0, in October 1997. October the following year they released the C/C++ standard Current version is 4.5, released in November 2015 (most compilers adopt the older 4.0 from July 2013)

The basics OpenMP is an implementation of multithreading A master thread "forks" a specified number of slave threads Tasks are divided among slaves Slaves run concurrently as the runtime environment allocates threads to different cores/processors

Most OpenMP constructs are compiler directives or pragmas The focus of OpenMP is to parallelize loops OpenMP offers an incremental approach to parallelism

Overview of OpenMP A compiler directive in C/C++ is called a pragma (pragmatic information). It is a preprocessor directive, thus it is declared with a hash (#). Compiler directives specific to OpenMP in C/C++ are written in codes as follows: #pragma omp <rest of pragma>

The OpenMP programming model Based on compiler directives Nested Parallelism Support API allows parallel constructs inside other parallel constructs Dynamic Threads API allows to dynamically change the number of threads which execute different parallel regions Input/Output OpenMP specifies nothing about parallel I/O The programmer has to insure that I/O is conducted correctly within the context of a multi-threaded program Memory consistency Threads can "cache" their data and are not required to maintain exact consistency with real memory all of the time When it is critical that all threads view a shared variable identically, the programmer is responsible for insuring that variables are FLUSHed by all threads as needed

Hello world in c++ #include <iostream> using namespace std; int main() { cout << "Hello World\n"; $./hello Hello World

Hello world in OpenMP #include <omp.h> #include <iostream> using namespace std; int main() { #pragma omp parallel num_threads(3) { cout << "Hello World\n ; // what if // cout << "Hello World" << endl; $./hello Hello World Hello World Hello World

Compiling an OpenMP program OpenMP consists of a set of directives and macros that are translated in C/C++ and thread instructions To compile a program using OpenMP, we need to include the flag openmp, g++ hello.cc fopenmp o hello The command compiles source hello.cc using the openmp library and generates the executable named hello

OpenMP directives A valid OpenMP directive must appear after the #pragma omp name name identifies an OpenMP directive The directive name can be followed by optional clauses (in any order) An OpenMP directive precedes the structured block which is enclosed by the directive itself #pragma omp name [clause,...] { Example #pragma omp parallel num_threads(3) { cout << "Hello World\n";

The parallel directive Politecnico Plays a crucial role in OpenMP, a program without a parallel directive is executed sequentially #pragma omp parallel [clause, clause,... ] { // block When a thread reaches a parallel directive, it creates a team of threads and becomes the master of the team The master is a member of that team and has thread number 0 within that team Other threads have number 1...N-1

The parallel directive Politecnico Starting from the beginning of the parallel region, the code is duplicated and all threads execute that code There is an implied barrier at the end of a parallel section Only the master thread continues execution past this point If any thread terminates within a parallel region, all threads in the team terminate......and the work done up until that point is undefined

How many threads? The number of threads in a parallel region is determined by the following factors, in order of precedence: Evaluation of the if clause Setting of the num_threads clause Use of the omp_set_num_threads() library function Setting of the OMP_NUM_THREADS environment variable Implementation default, e.g., the number of CPUs When present, the if clause must evaluate to true (i.e., nonzero integer) in order for a team of threads to be created; otherwise, the region is executed sequentially by the master thread

The OpenMP library The OpenMP standard defines an API for a variety of library functions: Query the number of threads/processors Set number of threads to use General purpose locking routines (semaphores) Portable wall clock timing routines Set execution environment functions: nested parallelism, dynamic adjustment of threads It may be necessary to specify the include file "omp.h"

The OpenMP library void omp_set_num_threads(int num_threads) Sets the number of threads that will be used in the next parallel region num_threads must be a postive integer This routine can only be called from the serial portions of the code int omp_get_num_threads(void) Returns the number of threads that are currently in the team executing the parallel region from which it is called int omp_get_thread_num(void) Returns the thread number of the thread, within the team, making this call This number will be between 0 and omp_get_num_threads()-1. The master thread of the team is thread 0

A better hello world #include <omp.h> #include <iostream> #include <sstream> using namespace std; $./hello Hello World. I am thread 1 Hello World. I am thread 0 Hello World. I am thread 2 int main() { #pragma omp parallel num_threads(3) { stringstream msg; msg << "Hello World. I am thread " << omp_get_thread_num() << endl; cout << msg.str();

Variable scoping in OpenMP OpenMP is based upon the shared memory programming model, most variables are shared by default The OpenMP Data Scope Attribute Clauses are used to explicitly define how variables should be scoped Four main scope types private/firstprivate shared default reduction

Variable scoping in OpenMP Data scope attribute clauses are used in conjunction with several directives to Define how and which data variables in the serial section of the program are transferred to the (duplicate) parallel sections Define which variables will be visible to all threads in the parallel sections and which variables will be privately allocated to all threads Data scope attribute clauses are effective only within their lexical/static extent

The private clause Syntax: private(list) The private clause declares variables in its list to be private to each thread Private variables behave as follows: a new object of the same type is declared once for each thread in the team all references to the original object are replaced with references to the new object variables should be assumed to be uninitialized for each thread

Example: private clause Politecnico Variable i and a must be declared before! #pragma omp parallel for private(i,a) for (i=0; i<n; i++) { a = i+1; printf("thread %d has a value of a = %d for i = %d\n", omp_get_thread_num(),a,i); /*-- End of parallel for --*/

The firstprivate clause Syntax: firstprivate(list) Similar to the private clause but the variable is initialized to the global value before entering the directive block

The shared clause Politecnico Syntax: shared (list) The shared clause declares variables in its list to be shared among all threads in the team A shared variable exists in only one memory location and all threads can read or write that address It is the programmer's responsibility to ensure that multiple threads properly access shared variables

Example: shared clause Politecnico #pragma omp parallel for shared(a) for (i=0; i<n; i++) { a[i] += i; /*-- End of parallel for --*/

The default clause Syntax: default (shared none) The default scope of all variables is either set to shared or none The default clause allows the user to specify a default scope for all variables in the lexical extent of any parallel region Specific variables can be exempted from the default using specific clauses (e.g., private, shared, etc.) Using none as a default requires that the programmer explicitly scope all variables

The reduction Clause Syntax: reduction (operator: list) Performs a reduction on the variables that appear in its list Operator defines the reduction operation A private copy for each variable is created for each thread At the end of the parallel section, the reduction operator is applied to all private copies of the shared variable, and the final result is written to the global shared variable

Work-Sharing Constructs Divide the execution of the enclosed code region among the members of the team that encounter it A work-sharing construct must be enclosed dynamically within a parallel region in order for the directive to execute in parallel Work-sharing constructs do not launch new threads There is no implied barrier upon entry to a work-sharing construct, however there is an implied barrier at the end of a work sharing construct for: shares iterations of a loop across the team (data parallelism) sections: breaks work into separate, discrete sections, each executed by a thread (functional parallelism) single: serializes a section of code

The for directive Politecnico #pragma omp for [clause...] for_loop Three main clause types: schedule, nowait, and data scope schedule (type [,chunk]) describes how iterations of the loop are divided among the threads in the team (default schedule is implementation dependent) nowait if specified, then threads do not synchronize at the end of the parallel loop data scope clauses (private, firstprivate, shared, reduction) still apply

The for directive (schedule types) static: loop iterations are divided into blocks of size chunk and then statically assigned to threads If chunk is not specified, the iterations are evenly (if possible) divided contiguously among the threads dynamic: loop iterations are divided into blocks of size chunk, and dynamically scheduled among the threads When a thread finishes one chunk, it is dynamically assigned another. The default chunk size is 1

Example: Dot Product int main() { double sum; vector<double> a(10000000), b(10000000); generate(a.begin(),a.end(),drand48); generate(b.begin(),b.end(),drand48); sum = 0; for (int i=0; i<a.size(); ++i) { sum = sum + a[i]*b[i]; cout << sum << endl;

Example: Dot Product int main() { double sum; vector<double> a(10000000), b(10000000); generate(a.begin(),a.end(),drand48); generate(b.begin(),b.end(),drand48); sum = 0; #pragma omp parallel for reduction(+: sum) for (int i=0; i<a.size(); ++i) { sum = sum + a[i]*b[i]; cout << sum << endl;

Example: Dot Product... int chunksize = a.size()/omp_get_max_threads(); #pragma omp parallel for schedule(dynamic, chunksize) reduction(+: sum) for (int i=0; i<a.size(); ++i) { sum = sum + a[i]*b[i];

The sections directive #pragma omp sections [clause...] { #pragma omp section { // execute A #pragma omp section { // execute B private reduction nowait etc. Specifies that the enclosed section(s) of code are to be divided among the threads in the team Each section is executed once by a thread in the team, and different sections may be executed by different threads It is possible for a thread to execute more than one section

Example: Dot product (vectors init) #pragma omp sections nowait { #pragma omp section { int ita; #pragma omp parallel for shared(a) private(ita) for(ita=0; ita<a.size(); ++ita) { a[ita] = drand48(); #pragma omp section { int itb; #pragma omp parallel for shared(b) private(itb) for(itb=0; itb<b.size(); ++itb) { b[itb] = drand48();

The single directive Politecnico The single directive is associated with the structured block of code immediately following it and specifies that this block should be executed by one thread only It does not state which thread should execute the code block The thread chosen could vary from one run to another, and it can also differ for different single constructs within one application #pragma omp single [private nowait ] {...

Example: single directive Politecnico #pragma omp parallel shared(a,b) private(i) { #pragma omp single { a = 10; cout << "Single construct executed by thread " << omp_get_thread_num() << endl; /* A barrier is automatically inserted here */ #pragma omp for for (i=0; i<n; i++) b[i] = a; /*-- End of parallel region --*/

The critical directive Politecnico #pragma omp critical [ name ] { The critical directive specifies a region of code that must be executed by only one thread at a time If a thread is currently executing inside a critical region and another thread reaches that critical region and attempts to execute it, it will block until the first thread exits that critical region The optional name enables multiple different critical regions Names act as global identifiers Critical regions with the same name are treated as the same region All critical sections which are unnamed, are treated as the same section

Example: critical directive Politecnico sum = 0; #pragma omp parallel shared(n,a,sum) private(tid,sumlocal) { TID = omp_get_thread_num(); double sumlocal = 0; #pragma omp for for (int i=0; i<n; i++) sumlocal += a[i]; #pragma omp critical (update_sum) { sum += sumlocal; cout << "TID=" << TID << "sumlocal=" << sumlocal << " sum=" << sum << endl; /*-- End of parallel region --*/ cout << "Value of sum after parallel region: " << sum << endl;

The barrier directive Politecnico #pragma omp barrier The barrier directive synchronizes all threads in the team When a barrier is reached, a thread will wait at that point until all other threads have reached that barrier All threads then resume executing in parallel the code that follows the barrier All threads in a team (or none) must execute the barrier region

When to Use OpenMP? Target platform is multicore or multiprocessor Application is cross-platform Parallelizable loops Last-minute optimizations