Shared memory parallel computing

Similar documents
OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

ECE 574 Cluster Computing Lecture 10

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

EPL372 Lab Exercise 5: Introduction to OpenMP

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

CS 470 Spring Mike Lam, Professor. OpenMP

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

CS 470 Spring Mike Lam, Professor. OpenMP

Distributed Systems + Middleware Concurrent Programming with OpenMP

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 4: OpenMP Open Multi-Processing

POSIX Threads and OpenMP tasks

Introduction to OpenMP

Parallel programming using OpenMP

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

A brief introduction to OpenMP

Session 4: Parallel Programming with OpenMP

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

Shared Memory Programming Model

Introduction to OpenMP

Shared Memory Programming with OpenMP

Shared Memory programming paradigm: openmp

OpenMP threading: parallel regions. Paolo Burgio

OPENMP OPEN MULTI-PROCESSING

Programming Shared-memory Platforms with OpenMP. Xu Liu

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

OpenMP Fundamentals Fork-join model and data environment

Parallel Programming using OpenMP

Parallel Programming using OpenMP

Parallel Programming with OpenMP. CS240A, T. Yang

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Multithreading in C with OpenMP

OpenMP 4.0/4.5. Mark Bull, EPCC

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Review. Tasking. 34a.cpp. Lecture 14. Work Tasking 5/31/2011. Structured block. Parallel construct. Working-Sharing contructs.

CS 5220: Shared memory programming. David Bindel

Introduction to OpenMP.

Alfio Lazzaro: Introduction to OpenMP

Introduction to OpenMP

Overview: The OpenMP Programming Model

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Data Environment: Default storage attributes

Synchronisation in Java - Java Monitor

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

Review. Lecture 12 5/22/2012. Compiler Directives. Library Functions Environment Variables. Compiler directives for construct, collapse clause

OpenMP 4.0. Mark Bull, EPCC

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

High Performance Computing: Tools and Applications

OpenMP Shared Memory Programming

Shared Memory Parallelism - OpenMP

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

Computer Architecture

[Potentially] Your first parallel application

Introduction to. Slides prepared by : Farzana Rahman 1

Shared Memory Programming with OpenMP

Parallel Numerical Algorithms

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

HPCSE - I. «OpenMP Programming Model - Part I» Panos Hadjidoukas

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

CS691/SC791: Parallel & Distributed Computing

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell

Programming Shared Address Space Platforms using OpenMP

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

OpenMP on Ranger and Stampede (with Labs)

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

OpenMP Application Program Interface

An Introduction to OpenMP

15-418, Spring 2008 OpenMP: A Short Introduction

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

An Introduction to OpenMP

DPHPC: Introduction to OpenMP Recitation session

GCC Developers Summit Ottawa, Canada, June 2006

OpenMP - Introduction

19.1. Unit 19. OpenMP Library for Parallelism

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Parallel Programming

HPCSE - II. «OpenMP Programming Model - Tasks» Panos Hadjidoukas

GLOSSARY. OpenMP. OpenMP brings the power of multiprocessing to your C, C++, and. Fortran programs. BY WOLFGANG DAUTERMANN

CME 213 S PRING Eric Darve

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

Parallel Programming in C with MPI and OpenMP

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

Advanced OpenMP. Tasks

COMP Parallel Computing. SMM (2) OpenMP Programming Model

Parallel and Distributed Programming. OpenMP

Shared Memory Programming with OpenMP

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Shared Memory Programming : OpenMP

Transcription:

Shared memory parallel computing OpenMP Sean Stijven Przemyslaw Klosiewicz

Shared-mem. programming API for SMP machines Introduced in 1997 by the OpenMP Architecture Review Board! More high-level than manual thread programming C/C++ & FORTRAN - Widely supported by most compilers, except Clang :( Compaq / Digital, HP, Intel, IBM, KAI, Silicon Graphics, Sun, US DoE We only see C/C++ By the way: OpenMP & C++ is not the best combination ever!

Compiler support: OpenMP 2.5 in GCC 4.2 OpenMP 3.0 in GCC 4.4, Intel 11.0 OpenMP 3.1 in GCC 4.7, Intel 12.1 OpenMP 4.0 in GCC 4.9 Not yet in Clang / LLVM, unfortunately Official OpenMP specification docs: http://openmp.org/wp/openmp-specifications/ On GCC s implementation of OpenMP: http://gcc.gnu.org/wiki/openmp

OpenMP team of parallel threads OpenMP fork-join model:!!! Programmer interacts with OpenMP mostly through compiler directives. (all directives start with #pragma omp) (other API calls need: #include <omp.h>)

- first example - 1 #ifndef _OPENMP! 2 # error("the whole point of OpenMP examples is to use OpenMP")! 3 #endif! 4! 5 #include <iostream>! 6 #include <omp.h>! 7! 8 using namespace std;! 9! 10 int main(int argc, char* argv[]) {! 11! 12 #pragma omp parallel! 13 {! 14 cout << "Hello from thread " << omp_get_thread_num() << endl;! 15 }! 16 // end of #pragma omp parallel! 17! 18 return 0;! 19 } Compiler flags: GCC: -fopenmp Intel: -openmp Output: Hello from thread Hello from thread Hello from thread Hello from thread Hello from thread Hello from thread Hello from thread Hello from thread 10637245

- first example - 1 #ifndef _OPENMP! 2 # error("the whole point of OpenMP examples is to use OpenMP")! 3 #endif! 4! 5 #include <iostream>! 6 #include <omp.h>! 7! 8 using namespace std;! 9! 10 int main(int argc, char* argv[]) {! 11! 12 #pragma omp parallel! 13 {! 14 cout << "Hello from thread " << omp_get_thread_num() << endl;! 15 }! 16 // end of #pragma omp parallel! 17! 18 return 0;! 19 } General form of directives: #pragma omp <directive name> [clauses...] <newline>

OpenMP #pragma omp <directive name> [clauses...] <newline> Directives & work-sharing constructs Synchronisation Clauses (= options) (especially data scope clauses) API calls & environment variables (I m roughly following https://computing.llnl.gov/tutorials/openmp/)

OpenMP #pragma omp <directive name> [clauses...] <newline> Directives & work-sharing constructs Synchronisation Clauses (= options) (especially data scope clauses) API calls & environment variables

parallel directive Creates a team of threads. (the master has id = 0) All threads execute code in this block Implicit join at end of block If one thread terminates abnormally, all terminate Usually MOST other OpenMP constructs should be inside this block! #pragma omp parallel {... }

parallel directive Number of threads determined by: clause: if (<boolean expression>) clause: num_threads(n)... #pragma omp parallel {... } environment variable: OMP_NUM_THREADS default: determined by the runtime omp_get_num_threads() returns the size of active team

parallel directive 1 #ifndef _OPENMP! 2 # error("the whole point of OpenMP examples is to use OpenMP")! 3 #endif! 4! 5 #include <iostream>! 6 #include <omp.h>! 7! 8 using namespace std;! 9! 10 int main(int argc, char* argv[]) {! 11 bool do_stuff_in_parallel = false;! 12 #pragma omp parallel if (do_stuff_in_parallel)! 13 {! 14 cout << "Hello from thread " << omp_get_thread_num() << endl;! 15 }! 16 // end of #pragma omp parallel! 17! 18 return 0;! 19 }

- work sharing directives - For loop: data parallelism i.e.: executing a for-loop over a data range in parallel Sections: functional parallelism i.e.: kind-of tasks run in parallel Single: restrict execution to one thread

for directive #pragma omp for for (int i = 0; i < n; ++i) {... } No endless loops & premature breaks! No manual fiddling around with the loop index! STL iterators should in theory be allowed but can be quirky to get working

for directive Remember: inside this block #pragma omp parallel { #pragma omp for for (int i = 0; i < n; ++i) { result[i] = some_work(...); } } Scheduling: Most probably: static, but decided by the runtime Otherwise: #pragma omp for schedule (dynamic, <chunk size>) #pragma omp for schedule (runtime) #pragma omp for schedule (auto)

for directive Shorthand notation: #pragma omp parallel for for (int i = 0; i < n; ++i) { result[i] = some_work(...); }

sections directive #pragma omp sections { #pragma omp section { // executed by a thread } #pragma omp section { // executed by another thread } } Core 0 Core 1 Each section will be executed by one thread in the team

sections directive Shorthand notation: #pragma omp parallel sections { #pragma omp section {... }... }

single directive #pragma omp single { // Executed by one thread } You really don t know which thread will execute this section Useful for I/O, timing,...

New in 3.0! OpenMP task directive #pragma omp task {... } Explicitly creates a task that will be scheduled now... or later Similar to sections, but allows nesting, recursion and dependences on other tasks!!! 4.0

OpenMP #pragma omp <directive name> [clauses...] <newline> Directives & work-sharing constructs Synchronisation Clauses (= options) (especially data scope clauses) API calls & environment variables

OpenMP #pragma omp master { // executed by thread with id = 0 } Similar to single, but this time you know which thread will execute

OpenMP #pragma omp critical [name] { // executed by one thread at a time } Defines a critical section You can use names to distinguish between different critical sections. Unnamed are treated as if they had the same name

OpenMP #pragma omp atomic <statement> A minimal critical section of just one statement Can often be optimized by the compiler to be faster than a locking critical section! <statement> uses a scalar lvalue x and can be: ++x, --x, x++, x-- x <op.>= expr. (op. is +,-,*,/,^,&,,<<,>>) (expr. does not contain x) (evaluation of expr. is NOT atomic, load/store of x is)

OpenMP #pragma omp barrier Synchronise all threads in a team (i.e.: join, without terminating)

New in 3.0! OpenMP #pragma omp taskwait #pragma omp taskgroup 4.0 Join for tasks: current task suspends until direct child tasks complete.! taskgroup waits for all descendant tasks

OpenMP #pragma omp flush [(<variables,...>)] Makes sure the variable(s) are properly flushed to memory and are coherent between the threads! This is actually pretty important, fortunately it s implied for: barrier parallel - upon entry and exit critical - upon entry and exit ordered - upon entry and exit for - upon exit sections - upon exit single - upon exit... unless nowait was specified!

OpenMP #pragma omp ordered {... } When inside a parallel loop (with an ordered clause!), this block will be executed in sequential order while other parts of the loop can be run in parallel

OpenMP 1 #include <iostream>! 2 using namespace std;! 3! 4 int main(int argc, char* argv[]) {! 5 #pragma omp parallel! 6 {! 7 #pragma omp for ordered! 8 for (int i = 0; i < 4; ++i) {! 9 cout << "i = " << i << endl;! 10 #pragma omp ordered! 11 cout << "(ordered) i = " << i << endl;! 12 }! 13 }! 14! 15 return 0;! 16 } i = i = i = i = 1023!!!! (ordered) i = 0! (ordered) i = 1! (ordered) i = 2! (ordered) i = 3

OpenMP #pragma omp <directive name> [clauses...] <newline> Directives & work-sharing constructs Synchronisation Clauses (= options) (especially data scope clauses) API calls & environment variables

nowait clause #pragma omp parallel { #pragma omp for nowait for (...) {... } // no implicit barrier! } // implicit barrier Also for sections and single

shared / private variables Data scope clauses define how information is passed / shared between threads int a = 1; int b = 2; #pragma omp parallel shared(a) private(b) { // a is 1 in all threads and refers to // the same place in memory! // b is private in each thread // also its value is NOT copied! // instead, it s uninitialized! }

shared / private variables int a = 1; int b = 2; #pragma omp parallel shared(a) firstprivate(b) { // a is 1 in all threads and refers to // the same place in memory! // b is private in each thread // and its original value IS copied! }

shared / private variables By default: all are shared (except for the loop index!) int a = 1; int b = 2; #pragma omp parallel default(none) { // Error: setting default to none // forces explicit definition of scoping }

reduction Reduction is an important concept in parallel computing: Combine n values from many threads to 1 value E.g.: vector norm, sum of elements in array, etc... reduction (<op.>:<variables>) clause defines reduction variables <op> is one of: +, -, *, &,, ^, &&,

reduction!= not defined

OpenMP #pragma omp <directive name> [clauses...] <newline> Directives & work-sharing constructs Synchronisation Clauses (= options) (especially data scope clauses) API calls & environment variables

- API calls - int omp_get_thread_num() ID of the executing thread int omp_get_num_threads() Number of threads in the team double omp_get_wtime() Number of seconds from some point in the past (use to calculate time differences) int omp_get_max_threads() Max number of threads in a team... many more. Lock variables etc...

- env. variables - OMP_NUM_THREADS Number of threads OpenMP will use by default quite convenient: $ OMP_NUM_THREADS=2./myawesomeprogram OMP_DYNAMIC Use dynamic scheduling OMP_NESTED Allow nested parallelism, see docs... & many more, some platform / compiler bound

Real world scenario: Parallelize someone else s sh*tty code The plan: Find (crappy) code for the Mandelbrot fractal Try to parallelize it with OpenMP (and make sure it still works as intended!!!) Measure speedup (or the lack thereof!) Btw.: good explanation of the Mandelbrot set: http://warp.povusers.org/mandelbrot/

- case study - Original code: C-ish C++ Global variables. All of them! Two big loops that look parallelizable! At least it shows this:

- case study - First attempt: 2x #pragma parallel omp for Segmentation fault

OpenMP - case study - Second attempt: 2x #pragma parallel omp for private(j) Second loop variable Not exactly correct...

- case study - Actually working solution: #pragma omp parallel for private(x, y, x1, y1, x2, y2, j, k) <first loop>! #pragma omp parallel for private(j, c) <second loop> proof :

- case study - Gene M. Amdahl ( Strong scaling ) Now, let s estimate speedup! Place omp_get_wtime() calls to measure: execution time of the whole program execution time of the loops we want to parallelize Domain: 4000 x 4000 pixels, serial execution: ~55.5% of the time spent in loops can be parallelized => expected speedup is ~2x, at most! Now let s measure the actual speedup...

- case study - Do this analysis when you parallelize programs!

- implementation details - Maybe you remember from POSIX threads: Thread creation is pretty expensive How does (GNU) OpenMP handle that? Any tricks to improve performance?

- implementation details - Compile this with g++ (no optim., debug!):!!! #pragma omp parallel { cout << whatever ; } Disassemble: objdump -dgsc my_binary > my_source.asm Look at the loaded libraries: ldd my_binary --- snip --- libgomp.so.1 => /usr/lib/libgomp.so.1 (0x00007feb8f637000) --- snap --- OpenMP runtime lib., GNU implementation

- implementation details - Look at the disassembled code of your program:

- implementation details - Remember libgomp.so. It s part of GNU GCC! The symbols GOMP_parallel_start/end are defined there! (Check with nm) Get source code of gcc from: http://gcc.gnu.org/gcc-4.6/ the right file is: gcc-core-4.6.3.tar.gz Look at file libgomp/parallel.c:104 void GOMP_parallel_start (void (*fn) (void *), void *data, unsigned num_threads) Look at file libgomp/team.c:251 void gomp_team_start (... ) GNU OpenMP uses a pool of reusable POSIX threads!

250 /* Launch a team. */! 251! 252 void! 253 gomp_team_start (void (*fn) (void *), void *data, unsigned nthreads,! 254 struct gomp_team *team)! 255 {! 256 struct gomp_thread_start_data *start_data;! 257 struct gomp_thread *thr, *nthr;! 258 struct gomp_task *task;! 259 struct gomp_task_icv *icv;! 260 bool nested;! 261 struct gomp_thread_pool *pool;! 262 unsigned i, n, old_threads_used = 0;! 263 pthread_attr_t thread_attr, *attr;! 264 unsigned long nthreads_var; libgomp/team.h 404 /* Launch new threads. */! 405 for (; i < nthreads; ++i, ++start_data)! 406 {! 407 pthread_t pt;! 408 int err;! 409! 410 start_data->fn = fn;! 411 start_data->fn_data = data;! 412 start_data->ts.team = team;! 413 start_data->ts.work_share = &team->work_shares[0];! 414 start_data->ts.last_work_share = NULL;! 415 start_data->ts.team_id = i;!!...!! 428 if (gomp_cpu_affinity!= NULL)! 429 gomp_init_thread_affinity (attr);! 430! 431 err = pthread_create (&pt, attr, gomp_thread_start, start_data);! 432 if (err!= 0)! 433 gomp_fatal ("Thread creation failed: %s", strerror (err));! 434 }

#pragma omp parallel { body; }... becomes... OpenMP - implementation details - According to http://gcc.gnu.org/onlinedocs/libgomp.pdf void subfunction (void* data) { body; }! setup data; GOMP_parallel_start(subfunction, &data, num_threads); subfunction(&data); GOMP_parallel_end();

- assignment - Read: 32 OpenMP Traps For C++ Developers http://www.viva64.com/en/a/0054/ and other documents I will put on Blackboard / site Experiment with small toy programs Try to parallelize small existing codes

OpenMP 4.0 - the future is now! - Offloading code to GPUs & accelerators such as the Xeon Phi SIMD / vectorization support User defined reductions Error handling, thread affinity, task dependencies, Killer feature: FORTRAN 2003 support!