!OMP #pragma opm _OPENMP

Similar documents
OPENMP TIPS, TRICKS AND GOTCHAS

OPENMP TIPS, TRICKS AND GOTCHAS

Introduction to OpenMP. Lecture 4: Work sharing directives

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Introduction to OpenMP

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

Introduction to OpenMP

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Shared Memory Programming with OpenMP

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

Data Environment: Default storage attributes

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Multithreading in C with OpenMP

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Advanced OpenMP. OpenMP Basics

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

Introduction to OpenMP

Introduction to OpenMP.

Session 4: Parallel Programming with OpenMP

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

CSL 860: Modern Parallel

Introduction to OpenMP

EPL372 Lab Exercise 5: Introduction to OpenMP

OpenMP on Ranger and Stampede (with Labs)

Introduction to OpenMP

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Introduction to OpenMP. Tasks. N.M. Maclaren September 2017

Allows program to be incrementally parallelized

OpenMP Application Program Interface

ECE 574 Cluster Computing Lecture 10

Introduction to. Slides prepared by : Farzana Rahman 1

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Enhancements in OpenMP 2.0

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Lecture 4: OpenMP Open Multi-Processing

Introduction to Programming with OpenMP

Parallel Programming using OpenMP

[Potentially] Your first parallel application

Introduction to OpenMP

Parallel Programming using OpenMP

Introduction [1] 1. Directives [2] 7

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Overview: The OpenMP Programming Model

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

Introduction to Standard OpenMP 3.1

Multi-core Architecture and Programming

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Debugging with TotalView

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Introduction to OpenMP

Parallel programming using OpenMP

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

OpenMP: Open Multiprocessing

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Shared Memory Parallelism - OpenMP

Introduction to OpenMP. Lecture 2: OpenMP fundamentals

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

A brief introduction to OpenMP

Computational Mathematics

Shared Memory Programming Paradigm!

High Performance Computing: Tools and Applications

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

OpenMP - Introduction

COMP Parallel Computing. SMM (2) OpenMP Programming Model

CS 470 Spring Mike Lam, Professor. OpenMP

Introduction to OpenMP

Practical in Numerical Astronomy, SS 2012 LECTURE 12

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

OpenMP 4.0. Mark Bull, EPCC

CS 470 Spring Mike Lam, Professor. OpenMP

Programming Shared Memory Systems with OpenMP Part I. Book

OpenMP: Open Multiprocessing

Why C? Because we can t in good conscience espouse Fortran.

Introduction to OpenMP

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

Parallel Programming in C with MPI and OpenMP

Practical stuff! ü OpenMP. Ways of actually get stuff done in HPC:

Introduction to OpenMP

Shared Memory Parallelism using OpenMP

OpenMP 4.0/4.5. Mark Bull, EPCC

OPENMP OPEN MULTI-PROCESSING

OpenMP Fundamentals Fork-join model and data environment

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

OpenMP Tutorial. Dirk Schmidl. IT Center, RWTH Aachen University. Member of the HPC Group Christian Terboven

Parallel Programming with OpenMP. CS240A, T. Yang

Concurrent Programming with OpenMP

Introduction to OpenMP

Lecture 2: Introduction to OpenMP with application to a simple PDE solver

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Towards OpenMP for Java

Introduction to OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Transcription:

Advanced OpenMP Lecture 12: Tips, tricks and gotchas Directives Mistyping the sentinel (e.g.!omp or #pragma opm ) typically raises no error message. Be careful! The macro _OPENMP is defined if code is compiled with the OpenMP switch. You can use this to conditionally compile code so that it works with and without OpenMP enabled. If you want to link dummy OpenMP library routines into sequential code, there is code in the standard you can copy (Appendix B) 1

Parallel regions The overhead of executing a parallel region is typically in the 10-100 microseconds range depends on compiler, hardware, no. of threads You can use the EPCC OpenMP microbenchmarks to do detailed measurements of overheads on your system. Download from www.epcc.ed.ac.uk/research/computing/ performance-characterisation-and-benchmarking The sequential execution time of a section of code has to be several times this to make it worthwhile parallelising. If a code section is only sometimes long enough, use the if clause to decide at runtime whether to go parallel or not. Overhead on one thread is typically much smaller (<1µs). Is my loop parallelisable? Quick and dirty test for whether the iterations of a loop are independent. Run the loop in reverse order!! Not infallible, but counterexamples are quite hard to construct. 2

Loops and nowait #pragma omp parallel { #pragma omp for schedule(static) nowait for(i=0;i<n;i++){ a[i] =... #pragma omp for schedule(static) for(i=0;i<n;i++){... = a[i] This is safe so long as the number of iterations in the two loops and the schedules are the same (must be static, but you can specify a chunksize) Guaranteed to get same mapping of iterations to threads. Default schedule Note that the default schedule for loops with no schedule clause is implementation defined. Doesn t have to be STATIC. In practice, in all implementations I know of, it is. Nevertheless you should not rely on this! 3

Tuning the chunksize Tuning the chunksize for static or dynamic schedules can be tricky because the optimal chunksize can depend quite strongly on the number of threads. It s often more robust to tune the number of chunks per thread and derive the chunksize from that. chunksize expression does not have to be a compile-time constant SINGLE or MASTER? Both constructs cause a code block to be executed by one thread only, while the others skip it: which should you use? MASTER has lower overhead (it s just a test, whereas SINGLE requires some synchronisation). But beware that MASTER has no implied barrier! If you expect some threads to arrive before others, use SINGLE. 4

Fortran 90 array syntax Can t use loop directives directly to parallelise Fortran 90 array syntax WORKSHARE is a worksharing directive (!) which allows parallelisation of Fortran 90 array operations, WHERE and FORALL constructs. Syntax:!$OMP WORKSHARE block!$omp END WORKSHARE [NOWAIT] 9 Workshare directive (cont.) Simple example REAL A(100,200), B(100,200), C(100,200)...!$OMP PARALLEL!$OMP WORKSHARE A=B+C!$OMP END WORKSHARE!$OMP END PARALLEL N.B. No schedule clause: distribution of work units to threads is entirely up to the compiler! If the compiler doesn t do a good job, you may need to expose a loop explicitly. There is a synchronisation point at the end of the workshare: all threads must finish their work before any thread can proceed 10 5

Workshare directive (cont.) Can also contain array intrinsic functions, WHERE and FORALL constructs, scalar assignment to shared variables, ATOMIC and CRITICAL directives. No branches in or out of block. No function calls except array intrinsics and those declared ELEMENTAL. Combined directive:!$omp PARALLEL WORKSHARE block!$omp END PARALLEL WORKSHARE 11 Workshare directive (cont.) Example:!$OMP PARALLEL WORKSHARE A = B + C WHERE (D.ne. 0) E = 1/D!$OMP ATOMIC t = t + SUM(F) FORALL (i=1:n, X(i)=0) X(i)= 1!$OMP END PARALLEL WORKSHARE 12 6

Data sharing attributes Don t forget that private variables are uninitialised on entry to parallel regions! Can use firstprivate, but it s more likely to be an error. Always, always use default(none) I mean always. No exceptions! Everybody suffers from variable blindness. Spot the bug! #pragma omp parallel for shared (a,b,c,d,n,m)\ private(temp) for(i=0;i<n;i++){ for (j=0;j<m;j++){ temp = b[i]*c[j]; a[i][j] = temp * temp + d[i]; May always get the right result with sufficient compiler optimisation! 7

Huge long loops What should I do in this situation? do i=1,n... several pages of code referencing 100+ variables end do Determining the correct scope (private/shared/reduction) for all those variables is tedious, error prone and difficult to test adequately. Refactor sequential code to do i=1,n call loopbody(...) end do Make all loop temporary variables local to loopbody Pass the rest through argument list Much easier to test for correctness! Then parallelise... 8

Reduction race trap #pragma omp parallel shared(sum, b) { sum = 0.0; #pragma omp for reduction(+:sum) for(i=0;i<n:i++) { sum += b[i];... = sum; There is a race between the initialisation of sum and the updates to it at the end of the loop. Private global variables double foo; extern double foo; #pragma omp parallel \ private(foo) { foo =... a = somefunc(); double sumfunc(void){... = foo; Unspecified whether the reference to foo in somefunc is to the original storage or the private copy. Unportable and therefore unusable! If you want access to the private copy, pass it through the argument list. 9

Missing SAVE or static Compiling my sequential code with the OpenMP flag caused it to break: what happened? You may have a bug in your code which is assuming that the contents of a local variable are preserved between function calls. compiling with OpenMP flag forces all local variables to be stack allocated and not heap allocated might also cause stack overflow Need to use SAVE or static correctly but these variables are then shared by default may need to make them threadprivate first time through code may need refactoring (e.g. execute it before the parallel region) Critical and atomic You can t protect updates to shared variables in one place with atomic and another with critical, if they might contend. No mutual exclusion between these critical protects code, atomic protects memory locations. #pragma omp parallel { #pragma omp critical a+=2; #pragma omp atomic a+=3; 10

Allocating storage based on number of threads Sometimes you want to allocate some storage whose size is determined by the number of threads. but how do you know how many threads the next parallel region will use? Can call omp_get_max_threads() which returns the value of the nthreads-var ICV. The number of threads used for the next parallel region will not exceed this except if a num_threads clause is used. Note that the implementation can always deliver fewer threads than this value if your code depends on there actually being a certain number of threads, you should always call omp_get_num_threads() to check Stack size If you have large private data structures, it is possible to run out of stack space. The size of thread stack apart from the master thread can be controlled by the OMP_STACKSIZE environment variable. The size of the master thread s stack is controlled in the same way as for sequential program (e.g. using ulimit ). OpenMP can t control this as by the time the runtime is called it s too late! 11

Environment for performance There are some environment variables you should set to maximise performance. don t rely on the defaults for these! OMP_WAIT_POLICY=active Encourages idle threads to spin rather than sleep OMP_DYNAMIC=false Don t let the runtime deliver fewer threads than you asked for OMP_PROC_BIND=true Prevents threads migrating between cores Debugging tools Traditional debuggers such as DDT or Totalview have support for OpenMP This is good, but they are not much help for tracking down race conditions debugger changes the timing of event on different threads Race detection tools work in a different way capture all the memory accesses during a run, then analyse this data for races which might have occured. Intel Inspector XE Oracle Solaris Studio (collect and discover tools, also works on Linix) 12

Profilers Standard profilers (gprof, IDE profilers) can be confusing they typically accumulate the time spent in functions across all threads. You can get a lot out of using timers ( omp_get_wtime()) Add timers round every parallel region, and round the whole code. work out which parallel regions have the worst speedup don t assume the time spent outside parallel regions is independent of the number of threads. Performance tools Vampir/Vampirtrace timeline traces can be very useful for visualising load balance Intel Vtune Scalasca breaks down overheads into different categories Rogue Wave Threadspotter statistical memory profiler uses tracing and simulation very good for finding cache/memory problems, including false sharing. 13