OPENMP TIPS, TRICKS AND GOTCHAS

Similar documents
OPENMP TIPS, TRICKS AND GOTCHAS

!OMP #pragma opm _OPENMP

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

Shared Memory Programming with OpenMP

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Introduction to OpenMP

Introduction to OpenMP. Lecture 4: Work sharing directives

Introduction to OpenMP

Introduction to OpenMP

Introduction to OpenMP

CSL 860: Modern Parallel

EPL372 Lab Exercise 5: Introduction to OpenMP

Data Environment: Default storage attributes

Multithreading in C with OpenMP

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Towards OpenMP for Java

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

Session 4: Parallel Programming with OpenMP

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Barbara Chapman, Gabriele Jost, Ruud van der Pas

ECE 574 Cluster Computing Lecture 10

Enhancements in OpenMP 2.0

Introduction to OpenMP.

Advanced OpenMP. OpenMP Basics

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Introduction to. Slides prepared by : Farzana Rahman 1

Introduction to Standard OpenMP 3.1

Parallel programming using OpenMP

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

OpenMP on Ranger and Stampede (with Labs)

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

Introduction to OpenMP

Introduction to OpenMP

Multi-core Architecture and Programming

Shared Memory Parallelism - OpenMP

Introduction to OpenMP. Tasks. N.M. Maclaren September 2017

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Lecture 4: OpenMP Open Multi-Processing

Allows program to be incrementally parallelized

Parallel Programming using OpenMP

OpenMP Fundamentals Fork-join model and data environment

Parallel Programming using OpenMP

Introduction to OpenMP

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Overview: The OpenMP Programming Model

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

Why C? Because we can t in good conscience espouse Fortran.

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Introduction to OpenMP

A brief introduction to OpenMP

Introduction to OpenMP

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Shared Memory Parallelism using OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

CS 470 Spring Mike Lam, Professor. OpenMP

Lecture 2: Introduction to OpenMP with application to a simple PDE solver

OpenMP 4.0. Mark Bull, EPCC

Distributed Systems + Middleware Concurrent Programming with OpenMP

Parallel Programming with OpenMP. CS240A, T. Yang

Point-to-Point Synchronisation on Shared Memory Architectures

OpenMP: Open Multiprocessing

OpenMP Application Program Interface

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Introduction to OpenMP. Lecture 2: OpenMP fundamentals

Introduction to Programming with OpenMP

[Potentially] Your first parallel application

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

An OpenMP Validation Suite

OpenMP 4.0/4.5. Mark Bull, EPCC

Introduction to OpenMP

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Introduction to OpenMP

Introduction to OpenMP

Parallel Programming: OpenMP

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Compiling and running OpenMP programs. C/C++: cc fopenmp o prog prog.c -lomp CC fopenmp o prog prog.c -lomp. Programming with OpenMP*

Chap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Shared Memory Programming with OpenMP. Lecture 3: Parallel Regions

CS691/SC791: Parallel & Distributed Computing

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

OpenMP: Open Multiprocessing

Programming Shared Memory Systems with OpenMP Part I. Book

Introduction to OpenMP

Shared Memory Programming with OpenMP

Advanced OpenMP. Tasks

Parallel Programming in C with MPI and OpenMP

Introduction [1] 1. Directives [2] 7

OpenMP Tutorial. Dirk Schmidl. IT Center, RWTH Aachen University. Member of the HPC Group Christian Terboven

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

Transcription:

OPENMP TIPS, TRICKS AND GOTCHAS Mark Bull EPCC, University of Edinburgh (and OpenMP ARB) markb@epcc.ed.ac.uk OpenMPCon 2015

OpenMPCon 2015 2 A bit of background I ve been teaching OpenMP for over 15 years to students and scientists I was chair of the OpenMP Language Committee for about 5 years I answer questions on the OpenMP Forum ( openmp.org/forum ) This is a collection of some accumulated wisdom in particular which you won t find in most tutorials... I hope you might find some of it helpful!

OpenMPCon 2015 3 Directives Mistyping the sentinel (e.g.!omp or #pragma opm ) typically raises no error message. Be careful! Extra nasty if it is e.g. #pragma opm atomic race condition! Write a script to search your code for your common typos

OpenMPCon 2015 4 Writing code that works without OpenMP too The macro _OPENMP is defined if code is compiled with the OpenMP switch. You can use this to conditionally compile code so that it works with and without OpenMP enabled. If you want to link dummy OpenMP library routines into sequential code, there is code in the standard you can copy (Appendix A in 4.0)

OpenMPCon 2015 5 Parallel regions The overhead of executing a parallel region is typically in the tens of microseconds range depends on compiler, hardware, no. of threads The sequential execution time of a section of code has to be several times this to make it worthwhile parallelising. If a code section is only sometimes long enough, use the if clause to decide at runtime whether to go parallel or not. Overhead on one thread is typically much smaller (<1μs). You can use the EPCC OpenMP microbenchmarks to do detailed measurements of overheads on your system. Download from www.epcc.ed.ac.uk/research/computing/performancecharacterisation-and-benchmarking

OpenMPCon 2015 6 Is my loop parallelisable? Quick and dirty test for whether the iterations of a loop are independent. Run the loop in reverse order!! Not infallible, but counterexamples are quite hard to construct.

OpenMPCon 2015 7 Loops and nowait #pragma omp parallel { #pragma omp for schedule(static) nowait for(i=0;i<n;i++){ a[i] =... } #pragma omp for schedule(static) for(i=0;i<n;i++){... = a[i] } } This is safe so long as the number of iterations in the two loops and the schedules are the same (must be static, but you can specify a chunksize) Guaranteed to get same mapping of iterations to threads.

OpenMPCon 2015 8 Default schedule Note that the default schedule for loops with no schedule clause is implementation defined. Doesn t have to be STATIC. In practice, in all implementations I know of, it is. Nevertheless you should not rely on this! Also note that SCHEDULE(STATIC) does not completely specify the distribution of loop iterations. don t write code that relies on a particular mapping of iterations to threads

OpenMPCon 2015 9 Tuning the chunksize Tuning the chunksize for static or dynamic schedules can be tricky because the optimal chunksize can depend quite strongly on the number of threads. It s often more robust to tune the number of chunks per thread and derive the chunksize from that. chunksize expression does not have to be a compile-time constant

OpenMPCon 2015 10 SINGLE or MASTER? Both constructs cause a code block to be executed by one thread only, while the others skip it: which should you use? MASTER has lower overhead (it s just a test, whereas SINGLE requires some synchronisation). But beware that MASTER has no implied barrier! If you expect some threads to arrive before others, use SINGLE, otherwise use MASTER

OpenMPCon 2015 11 Fortran 90 array syntax Can t use loop directives directly to parallelise Fortran 90 array syntax WORKSHARE is a worksharing directive (!) which allows parallelisation of Fortran 90 array operations, WHERE and FORALL constructs. In simple cases, avoids having to expose a loop. Performance portability may be an issue, however! Syntax:!$OMP WORKSHARE block!$omp END WORKSHARE [NOWAIT]

OpenMPCon 2015 12 Data sharing attributes Don t forget that private variables are uninitialised on entry to parallel regions! Can use firstprivate, but it s more likely to be an error. use cases for firstprivate are surprisingly rare.

OpenMPCon 2015 13 Default(none) The default behaviour for parallel regions and worksharing construct is default(shared) This is extremely dangerous - makes it far too easily to accidentally share variables. Possibly the worst design decision in the history of OpenMP! Always, always use default(none) I mean always. No exceptions! Everybody suffers from variable blindness.

OpenMPCon 2015 14 Spot the bug! #pragma omp parallel for private(temp) for(i=0;i<n;i++){ for (j=0;j<m;j++){ temp = b[i]*c[j]; a[i][j] = temp * temp + d[i]; } } May always get the right result with sufficient compiler optimisation!

OpenMPCon 2015 15 Private global variables double foo; #pragma omp parallel \ private(foo) { foo =... a = somefunc(); } extern double foo; double sumfunc(void){... = foo; } Unspecified whether the reference to foo in somefunc is to the original storage or the private copy. Unportable and therefore unusable! If you want access to the private copy, pass it through the argument list (or use threadprivate).

OpenMPCon 2015 16 Huge long loops What should I do in this situation? (typical old-fashioned Fortran style) do i=1,n... several pages of code referencing 100+ variables end do Determining the correct scope (private/shared/reduction) for all those variables is tedious, error prone and difficult to test adequately.

OpenMPCon 2015 17 Refactor sequential code to do i=1,n end do call loopbody(...) Make all loop temporary variables local to loopbody Pass the rest through argument list Much easier to test for correctness! Then parallelise... C/C++ programmers can declare temporaries in the scope of the loop body.

OpenMPCon 2015 18 Reduction race trap #pragma omp parallel shared(sum, b) { sum = 0.0; #pragma omp for reduction(+:sum) for(i=0;i<n:i++) { sum += b[i]; }... = sum; } There is a race between the initialisation of sum and the updates to it at the end of the loop.

OpenMPCon 2015 19 Missing SAVE or static Compiling my sequential code with the OpenMP flag caused it to break: what happened? You may have a bug in your code which is assuming that the contents of a local variable are preserved between function calls. compiling with OpenMP flag forces all local variables to be stack allocated and not heap allocated might also cause stack overflow Need to use SAVE or static correctly but these variables are then shared by default may need to make them threadprivate first time through code may need refactoring (e.g. execute it before the parallel region)

OpenMPCon 2015 20 Stack size If you have large private data structures, it is possible to run out of stack space. The size of thread stack apart from the master thread can be controlled by the OMP_STACKSIZE environment variable. The size of the master thread s stack is controlled in the same way as for sequential program (e.g. compiler switch or using ulimit ). OpenMP can t control this as by the time the runtime is called it s too late!

OpenMPCon 2015 21 Critical and atomic You can t protect updates to shared variables in one place with atomic and another with critical, if they might contend. No mutual exclusion between these critical protects code, atomic protects memory locations. #pragma omp parallel { #pragma omp critical a+=2; #pragma omp atomic a+=3; }

OpenMPCon 2015 22 Allocating storage based on number of threads Sometimes you want to allocate some storage whose size is determined by the number of threads. but how do you know how many threads the next parallel region will use? Can call omp_get_max_threads() which returns the value of the nthreads-var ICV. The number of threads used for the next parallel region will not exceed this except if a num_threads clause is used. Note that the implementation can always deliver fewer threads than this value if your code depends on there actually being a certain number of threads, you should always call omp_get_num_threads() to check

OpenMPCon 2015 23 Using tasks Getting the data attribute scoping right can be quite tricky default scoping rules different from other constructs as ever, using default(none) is a good idea Don t use tasks for things already well supported by OpenMP e.g. standard do/for loops the overhead of using tasks is greater Don t expect miracles from the runtime best results usually obtained where the user controls the number and granularity of tasks

OpenMPCon 2015 24 Environment for performance There are some environment variables you should set to maximise performance. don t rely on the defaults for these! OMP_WAIT_POLICY=active Encourages idle threads to spin rather than sleep OMP_DYNAMIC=false Don t let the runtime deliver fewer threads than you asked for OMP_PROC_BIND=true Prevents threads migrating between cores

OpenMPCon 2015 25 Debugging tools Traditional debuggers such as DDT or Totalview have support for OpenMP This is good, but they are not much help for tracking down race conditions debugger changes the timing of event on different threads Race detection tools work in a different way capture all the memory accesses during a run, then analyse this data for races which might have occured. Intel Inspector XE Oracle Solaris Studio Thread Analyzer

OpenMPCon 2015 26 Timers Make sure your timer actually does measure wall clock time! Do use omp_get_wtime()! Don t use clock() for example measures CPU time accumulated across all threads no wonder you don t see any speedup...