Shared memory parallel computing

Shared memory parallel computing OpenMP Sean Stijven Przemyslaw Klosiewicz

Shared-mem. programming API for SMP machines Introduced in 1997 by the OpenMP Architecture Review Board! More high-level than manual thread programming C/C++ & FORTRAN - Widely supported by most compilers, except Clang :( Compaq / Digital, HP, Intel, IBM, KAI, Silicon Graphics, Sun, US DoE We only see C/C++ By the way: OpenMP & C++ is not the best combination ever!

Compiler support: OpenMP 2.5 in GCC 4.2 OpenMP 3.0 in GCC 4.4, Intel 11.0 OpenMP 3.1 in GCC 4.7, Intel 12.1 OpenMP 4.0 in GCC 4.9 Not yet in Clang / LLVM, unfortunately Official OpenMP specification docs: http://openmp.org/wp/openmp-specifications/ On GCC s implementation of OpenMP: http://gcc.gnu.org/wiki/openmp

OpenMP team of parallel threads OpenMP fork-join model:!!! Programmer interacts with OpenMP mostly through compiler directives. (all directives start with #pragma omp) (other API calls need: #include <omp.h>)

- first example - 1 #ifndef _OPENMP! 2 # error("the whole point of OpenMP examples is to use OpenMP")! 3 #endif! 4! 5 #include <iostream>! 6 #include <omp.h>! 7! 8 using namespace std;! 9! 10 int main(int argc, char* argv[]) {! 11! 12 #pragma omp parallel! 13 {! 14 cout << "Hello from thread " << omp_get_thread_num() << endl;! 15 }! 16 // end of #pragma omp parallel! 17! 18 return 0;! 19 } Compiler flags: GCC: -fopenmp Intel: -openmp Output: Hello from thread Hello from thread Hello from thread Hello from thread Hello from thread Hello from thread Hello from thread Hello from thread 10637245

- first example - 1 #ifndef _OPENMP! 2 # error("the whole point of OpenMP examples is to use OpenMP")! 3 #endif! 4! 5 #include <iostream>! 6 #include <omp.h>! 7! 8 using namespace std;! 9! 10 int main(int argc, char* argv[]) {! 11! 12 #pragma omp parallel! 13 {! 14 cout << "Hello from thread " << omp_get_thread_num() << endl;! 15 }! 16 // end of #pragma omp parallel! 17! 18 return 0;! 19 } General form of directives: #pragma omp <directive name> [clauses...] <newline>

OpenMP #pragma omp <directive name> [clauses...] <newline> Directives & work-sharing constructs Synchronisation Clauses (= options) (especially data scope clauses) API calls & environment variables (I m roughly following https://computing.llnl.gov/tutorials/openmp/)

OpenMP #pragma omp <directive name> [clauses...] <newline> Directives & work-sharing constructs Synchronisation Clauses (= options) (especially data scope clauses) API calls & environment variables

parallel directive Creates a team of threads. (the master has id = 0) All threads execute code in this block Implicit join at end of block If one thread terminates abnormally, all terminate Usually MOST other OpenMP constructs should be inside this block! #pragma omp parallel {... }

parallel directive Number of threads determined by: clause: if (<boolean expression>) clause: num_threads(n)... #pragma omp parallel {... } environment variable: OMP_NUM_THREADS default: determined by the runtime omp_get_num_threads() returns the size of active team

parallel directive 1 #ifndef _OPENMP! 2 # error("the whole point of OpenMP examples is to use OpenMP")! 3 #endif! 4! 5 #include <iostream>! 6 #include <omp.h>! 7! 8 using namespace std;! 9! 10 int main(int argc, char* argv[]) {! 11 bool do_stuff_in_parallel = false;! 12 #pragma omp parallel if (do_stuff_in_parallel)! 13 {! 14 cout << "Hello from thread " << omp_get_thread_num() << endl;! 15 }! 16 // end of #pragma omp parallel! 17! 18 return 0;! 19 }

- work sharing directives - For loop: data parallelism i.e.: executing a for-loop over a data range in parallel Sections: functional parallelism i.e.: kind-of tasks run in parallel Single: restrict execution to one thread

for directive #pragma omp for for (int i = 0; i < n; ++i) {... } No endless loops & premature breaks! No manual fiddling around with the loop index! STL iterators should in theory be allowed but can be quirky to get working

for directive Remember: inside this block #pragma omp parallel { #pragma omp for for (int i = 0; i < n; ++i) { result[i] = some_work(...); } } Scheduling: Most probably: static, but decided by the runtime Otherwise: #pragma omp for schedule (dynamic, <chunk size>) #pragma omp for schedule (runtime) #pragma omp for schedule (auto)

for directive Shorthand notation: #pragma omp parallel for for (int i = 0; i < n; ++i) { result[i] = some_work(...); }

sections directive #pragma omp sections { #pragma omp section { // executed by a thread } #pragma omp section { // executed by another thread } } Core 0 Core 1 Each section will be executed by one thread in the team

sections directive Shorthand notation: #pragma omp parallel sections { #pragma omp section {... }... }

single directive #pragma omp single { // Executed by one thread } You really don t know which thread will execute this section Useful for I/O, timing,...

New in 3.0! OpenMP task directive #pragma omp task {... } Explicitly creates a task that will be scheduled now... or later Similar to sections, but allows nesting, recursion and dependences on other tasks!!! 4.0

OpenMP #pragma omp <directive name> [clauses...] <newline> Directives & work-sharing constructs Synchronisation Clauses (= options) (especially data scope clauses) API calls & environment variables

OpenMP #pragma omp master { // executed by thread with id = 0 } Similar to single, but this time you know which thread will execute

OpenMP #pragma omp critical [name] { // executed by one thread at a time } Defines a critical section You can use names to distinguish between different critical sections. Unnamed are treated as if they had the same name

OpenMP #pragma omp atomic <statement> A minimal critical section of just one statement Can often be optimized by the compiler to be faster than a locking critical section! <statement> uses a scalar lvalue x and can be: ++x, --x, x++, x-- x <op.>= expr. (op. is +,-,*,/,^,&,,<<,>>) (expr. does not contain x) (evaluation of expr. is NOT atomic, load/store of x is)

OpenMP #pragma omp barrier Synchronise all threads in a team (i.e.: join, without terminating)

New in 3.0! OpenMP #pragma omp taskwait #pragma omp taskgroup 4.0 Join for tasks: current task suspends until direct child tasks complete.! taskgroup waits for all descendant tasks

OpenMP #pragma omp flush [(<variables,...>)] Makes sure the variable(s) are properly flushed to memory and are coherent between the threads! This is actually pretty important, fortunately it s implied for: barrier parallel - upon entry and exit critical - upon entry and exit ordered - upon entry and exit for - upon exit sections - upon exit single - upon exit... unless nowait was specified!

OpenMP #pragma omp ordered {... } When inside a parallel loop (with an ordered clause!), this block will be executed in sequential order while other parts of the loop can be run in parallel

OpenMP 1 #include <iostream>! 2 using namespace std;! 3! 4 int main(int argc, char* argv[]) {! 5 #pragma omp parallel! 6 {! 7 #pragma omp for ordered! 8 for (int i = 0; i < 4; ++i) {! 9 cout << "i = " << i << endl;! 10 #pragma omp ordered! 11 cout << "(ordered) i = " << i << endl;! 12 }! 13 }! 14! 15 return 0;! 16 } i = i = i = i = 1023!!!! (ordered) i = 0! (ordered) i = 1! (ordered) i = 2! (ordered) i = 3

OpenMP #pragma omp <directive name> [clauses...] <newline> Directives & work-sharing constructs Synchronisation Clauses (= options) (especially data scope clauses) API calls & environment variables

nowait clause #pragma omp parallel { #pragma omp for nowait for (...) {... } // no implicit barrier! } // implicit barrier Also for sections and single

shared / private variables Data scope clauses define how information is passed / shared between threads int a = 1; int b = 2; #pragma omp parallel shared(a) private(b) { // a is 1 in all threads and refers to // the same place in memory! // b is private in each thread // also its value is NOT copied! // instead, it s uninitialized! }

shared / private variables int a = 1; int b = 2; #pragma omp parallel shared(a) firstprivate(b) { // a is 1 in all threads and refers to // the same place in memory! // b is private in each thread // and its original value IS copied! }

shared / private variables By default: all are shared (except for the loop index!) int a = 1; int b = 2; #pragma omp parallel default(none) { // Error: setting default to none // forces explicit definition of scoping }

reduction Reduction is an important concept in parallel computing: Combine n values from many threads to 1 value E.g.: vector norm, sum of elements in array, etc... reduction (<op.>:<variables>) clause defines reduction variables <op> is one of: +, -, *, &,, ^, &&,

reduction!= not defined

OpenMP #pragma omp <directive name> [clauses...] <newline> Directives & work-sharing constructs Synchronisation Clauses (= options) (especially data scope clauses) API calls & environment variables

- API calls - int omp_get_thread_num() ID of the executing thread int omp_get_num_threads() Number of threads in the team double omp_get_wtime() Number of seconds from some point in the past (use to calculate time differences) int omp_get_max_threads() Max number of threads in a team... many more. Lock variables etc...

- env. variables - OMP_NUM_THREADS Number of threads OpenMP will use by default quite convenient: $ OMP_NUM_THREADS=2./myawesomeprogram OMP_DYNAMIC Use dynamic scheduling OMP_NESTED Allow nested parallelism, see docs... & many more, some platform / compiler bound

Real world scenario: Parallelize someone else s sh*tty code The plan: Find (crappy) code for the Mandelbrot fractal Try to parallelize it with OpenMP (and make sure it still works as intended!!!) Measure speedup (or the lack thereof!) Btw.: good explanation of the Mandelbrot set: http://warp.povusers.org/mandelbrot/

- case study - Original code: C-ish C++ Global variables. All of them! Two big loops that look parallelizable! At least it shows this:

- case study - First attempt: 2x #pragma parallel omp for Segmentation fault

OpenMP - case study - Second attempt: 2x #pragma parallel omp for private(j) Second loop variable Not exactly correct...

- case study - Actually working solution: #pragma omp parallel for private(x, y, x1, y1, x2, y2, j, k) <first loop>! #pragma omp parallel for private(j, c) <second loop> proof :

- case study - Gene M. Amdahl ( Strong scaling ) Now, let s estimate speedup! Place omp_get_wtime() calls to measure: execution time of the whole program execution time of the loops we want to parallelize Domain: 4000 x 4000 pixels, serial execution: ~55.5% of the time spent in loops can be parallelized => expected speedup is ~2x, at most! Now let s measure the actual speedup...

- case study - Do this analysis when you parallelize programs!

- implementation details - Maybe you remember from POSIX threads: Thread creation is pretty expensive How does (GNU) OpenMP handle that? Any tricks to improve performance?

- implementation details - Compile this with g++ (no optim., debug!):!!! #pragma omp parallel { cout << whatever ; } Disassemble: objdump -dgsc my_binary > my_source.asm Look at the loaded libraries: ldd my_binary --- snip --- libgomp.so.1 => /usr/lib/libgomp.so.1 (0x00007feb8f637000) --- snap --- OpenMP runtime lib., GNU implementation

- implementation details - Look at the disassembled code of your program:

- implementation details - Remember libgomp.so. It s part of GNU GCC! The symbols GOMP_parallel_start/end are defined there! (Check with nm) Get source code of gcc from: http://gcc.gnu.org/gcc-4.6/ the right file is: gcc-core-4.6.3.tar.gz Look at file libgomp/parallel.c:104 void GOMP_parallel_start (void (*fn) (void *), void *data, unsigned num_threads) Look at file libgomp/team.c:251 void gomp_team_start (... ) GNU OpenMP uses a pool of reusable POSIX threads!

250 /* Launch a team. */! 251! 252 void! 253 gomp_team_start (void (*fn) (void *), void *data, unsigned nthreads,! 254 struct gomp_team *team)! 255 {! 256 struct gomp_thread_start_data *start_data;! 257 struct gomp_thread *thr, *nthr;! 258 struct gomp_task *task;! 259 struct gomp_task_icv *icv;! 260 bool nested;! 261 struct gomp_thread_pool *pool;! 262 unsigned i, n, old_threads_used = 0;! 263 pthread_attr_t thread_attr, *attr;! 264 unsigned long nthreads_var; libgomp/team.h 404 /* Launch new threads. */! 405 for (; i < nthreads; ++i, ++start_data)! 406 {! 407 pthread_t pt;! 408 int err;! 409! 410 start_data->fn = fn;! 411 start_data->fn_data = data;! 412 start_data->ts.team = team;! 413 start_data->ts.work_share = &team->work_shares[0];! 414 start_data->ts.last_work_share = NULL;! 415 start_data->ts.team_id = i;!!...!! 428 if (gomp_cpu_affinity!= NULL)! 429 gomp_init_thread_affinity (attr);! 430! 431 err = pthread_create (&pt, attr, gomp_thread_start, start_data);! 432 if (err!= 0)! 433 gomp_fatal ("Thread creation failed: %s", strerror (err));! 434 }

#pragma omp parallel { body; }... becomes... OpenMP - implementation details - According to http://gcc.gnu.org/onlinedocs/libgomp.pdf void subfunction (void* data) { body; }! setup data; GOMP_parallel_start(subfunction, &data, num_threads); subfunction(&data); GOMP_parallel_end();

- assignment - Read: 32 OpenMP Traps For C++ Developers http://www.viva64.com/en/a/0054/ and other documents I will put on Blackboard / site Experiment with small toy programs Try to parallelize small existing codes

OpenMP 4.0 - the future is now! - Offloading code to GPUs & accelerators such as the Xeon Phi SIMD / vectorization support User defined reductions Error handling, thread affinity, task dependencies, Killer feature: FORTRAN 2003 support!