Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Similar documents
Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Introduction to OpenMP

Introduction to. Slides prepared by : Farzana Rahman 1

Data Handling in OpenMP

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

ECE 574 Cluster Computing Lecture 10

Introduction to OpenMP

Parallel and Distributed Programming. OpenMP

Introduction to OpenMP

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Programming Shared-memory Platforms with OpenMP. Xu Liu

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Alfio Lazzaro: Introduction to OpenMP

Lecture 4: OpenMP Open Multi-Processing

Introduction to OpenMP

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

EPL372 Lab Exercise 5: Introduction to OpenMP

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Review. Tasking. 34a.cpp. Lecture 14. Work Tasking 5/31/2011. Structured block. Parallel construct. Working-Sharing contructs.

Data Environment: Default storage attributes

Parallel Programming using OpenMP

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Parallel Programming using OpenMP

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

A brief introduction to OpenMP

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell

OPENMP OPEN MULTI-PROCESSING

HPCSE - I. «OpenMP Programming Model - Part I» Panos Hadjidoukas

Shared Memory Parallelism - OpenMP

Session 4: Parallel Programming with OpenMP

Parallel Programming with OpenMP. CS240A, T. Yang

Introduction to OpenMP

Programming Shared-memory Platforms with OpenMP

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP Application Program Interface

Parallel Programming in C with MPI and OpenMP

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

OpenMP Application Program Interface

Introduction to Programming with OpenMP

Shared Memory Programming with OpenMP

Introduction to OpenMP. Lecture 4: Work sharing directives

Introduction [1] 1. Directives [2] 7

OpenMP on Ranger and Stampede (with Labs)

Multi-core Architecture and Programming

Introduction to OpenMP. Rogelio Long CS 5334/4390 Spring 2014 February 25 Class

Introduction to OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel programming using OpenMP

Compiling and running OpenMP programs. C/C++: cc fopenmp o prog prog.c -lomp CC fopenmp o prog prog.c -lomp. Programming with OpenMP*

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Programming Shared Memory Systems with OpenMP Part I. Book

Introduction to OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

OpenMP+F90 p OpenMP+F90

Distributed Systems + Middleware Concurrent Programming with OpenMP

Shared Memory Parallelism using OpenMP

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

CS 470 Spring Mike Lam, Professor. OpenMP

Introduction to OpenMP

OpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Parallel Computing Parallel Programming Languages Hwansoo Han

Introduction to OpenMP

OpenMP Technical Report 3 on OpenMP 4.0 enhancements

CSL 860: Modern Parallel

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Introduction to OpenMP.

Shared Memory programming paradigm: openmp

Cluster Computing 2008

Parallel Programming in C with MPI and OpenMP

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

PC to HPC. Xiaoge Wang ICER Jan 27, 2016


Module 11: The lastprivate Clause Lecture 21: Clause and Routines. The Lecture Contains: The lastprivate Clause. Data Scope Attribute Clauses

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Multithreading in C with OpenMP

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

Barbara Chapman, Gabriele Jost, Ruud van der Pas

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Introduction to OpenMP. Motivation

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

Practical in Numerical Astronomy, SS 2012 LECTURE 12

OpenMP Language Features

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

High Performance Computing: Tools and Applications

Transcription:

Introduction to OpenMP Martin Čuma Center for High Performance Computing University of Utah m.cuma@utah.edu

Overview Quick introduction. Parallel loops. Parallel loop directives. Parallel sections. Some more advanced directives. Summary. Survey https://www.surveymonkey.com/r/8hlj3wk 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 2

Shared memory All processors have access to local memory Simpler programming Concurrent memory access More specialized hardware CHPC : Linux clusters 12-28 core nodes Dual quad-core node CPU CPU CPU CPU CPU BUS BUS Memory Memory Many-core node (e.g. SGI) Memory Memory Memory CPU Memory 10/5/2017 http://www.chpc.utah.edu Slide 3

OpenMP basics Directives Runtime Library routines Environment variables Compiler directives to parallelize Fortran source code comments!$omp parallel/!$omp end parallel C/C++ - #pragmas #pragma omp parallel Small set of subroutines, environment variables!$ iam = omp_get_num_threads() 10/5/2017 http://www.chpc.utah.edu Slide 4

Programming model Shared memory, thread based parallelism Explicit parallelism Nested parallelism support Fork-join model master thread F O R K J O I N F O R K J O I N parallel region parallel region 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 5

https://www.surveymonkey.com/r/8hlj3wk Example 1 numerical integration 10/5/2017 http://www.chpc.utah.edu Slide 6 [ ] [ ] [ ] = = + + = + 1 1 0 1 1 ) ( ) ( ) ( 2 1 ) ( ) ( 2 1 ) ( n i i n n i i i b a x f h x f x f h x f x f h x f

Program code 1. program trapezoid integer n, i double precision a, b, h, x, integ, f print*,"input integ. interval, no. of trap:" read(*,*)a, b, n h = (b-a)/n integ = 0. master thread 2. 3.!$omp parallel do reduction(+:integ) private(x) do i=1,n-1 x = a+i*h integ = integ + f(x) enddo integ = integ + (f(a)+f(b))/2. integ = integ*h print*,"total integral = ",integ end parallel region F O R K J O I N 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 7

Program output em001:>%module load pgi em001:>%pgf77 mp=numa trap.f -o trap em001:>%setenv OMP_NUM_THREADS 12 em001:>%trap Input integ. interval, no. of trap: 0 10 100 Total integral = 333.3500000000001 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 8

Parallel do directive Fortran!$omp parallel do [clause [, clause]] [!$omp end parallel do] C/C++ #pragma omp parallel for [clause [clause]] Loops must have precisely determined trip count no do-while loops no change to loop indices, bounds inside loop (C) no jumps out of the loop (Fortran exit, goto; C break, goto) cycle (Fortran), continue (C) are allowed stop (Fortran), exit (C) are allowed 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 9

Clauses Control execution of parallel loop scope (shared, private) sharing of variables among the threads if whether to run in parallel or in serial schedule distribution of work across the threads collapse(n) combine nested loops into a single loop for more parallelism ordered perform loop in certain order copyin initialize private variables in the loop 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 10

Data sharing private each thread creates a private instance not initialized upon entry to parallel region undefined upon exit from parallel region default for loop indices, variables declared inside parallel loop shared all threads share one copy update modifies data for all other threads default everything else Changing default behavior default (shared private none) 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 11

Data reduction Threads distribute work Need to collect work at the end sum up total find minimum or maximum Reduction clause global operation on a variable!$omp parallel do reduction(+:var) #pragma omp parallel for reduction(+:var) Allowed operations - commutative +, *, max, min, logical 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 12

Data dependence Data in one loop iteration often depend on data written in another loop iteration Anti-dependence race between statement S 1 writing and S 2 reading removal: privatization Output dependence values from the last iteration used outside the loop removal: lastprivate clause Flow dependence x = a(i) b(i) = c + x a(i) = a(i+1) + x data at one iteration depend on data from another iteration removal: reduction, rearrangement, often impossible 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 13

Removing data dependencies Serial trapezoidal rule integ = 0. do i=1,n-1 x = a+i*h integ = integ + f(x) enddo Parallel solution integ = 0.!$omp parallel do do i=1,n-1 x = a+i*h integ = integ + f(x) enddo private(x) x anti-dependence privatization integ flow dependence - reduction Thread 1 x=a+i*h integ=integ+f(x) reduction (+:integ) Thread 2 x=a+i*h integ=integ+f(x) 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 14

Variable initialization firstprivate/lastprivate clause initialization of a private variable!$omp parallel do firstprivate(x) finalization of a private variable!$omp parallel do lastprivate(x) 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 15

Parallel overhead Parallelization costs CPU time Nested loops parallelize the outermost loop if clause parallelize only when it is worth it above certain number of iterations:!$omp parallel do if (n.ge. 800) do i = 1, n... enddo 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 16

Load balancing scheduling user-defined work distribution schedule (type[, chunk]) chunk number of iterations contiguously assigned to threads type static each thread gets a constant chunk dynamic work distribution to threads varies guided chunk size exponentially decreases runtime schedule decided at the run time 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 17

Static schedule timings Time (sec) 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 3 2.5 2 1.5 1 0.5 0 2 Threads 4 Threads 8 Threads on SGI Origin 2000 0 10 20 30 40 50 60 0 1000 2000Chunk 3000 4000 5000 Default Niter/Nproc 10/5/2017 http://www.chpc.utah.edu Slide 18

Different schedule timings 0.05 Time (sec) 0.04 0.03 0.02 Dynamic schedule Guided schedule Static schedule on SGI Origin 2000 NUM_OMP_THREADS = 8 Default 0.01 0 0 1000 2000 3000 4000 5000 Chunk 10/5/2017 http://www.chpc.utah.edu Slide 19

Example 2 MPI-like parallelization #include <stdio.h> #include "omp.h" #define min(a,b) ((a) < (b)? (a) : (b)) 1. int istart,iend; #pragma omp threadprivate(istart,iend) int main (int argc, char* argv[]){ int n,nthreads,iam,chunk; float a, b; double h, integ, p_integ; double f(double x); double get_integ(double a, double h); istart, iend global variables f, get_integ local functions 2. printf("input integ. interval, no. of trap:\n"); scanf("%f %f %d",&a,&b,&n); h = (b-a)/n; integ = 0.; 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 20

Example 2, cont. 3. 4. 5. 6. #pragma omp parallel shared(integ) private(p_integ,nthreads,iam,chunk){ nthreads = omp_get_num_threads(); iam = omp_get_thread_num(); chunk = (n + nthreads -1)/nthreads; istart = iam * chunk + 1; iend = min((iam+1)*chunk+1,n); p_integ = get_integ(a,h); #pragma omp atomic integ += p_integ; } integ += (f(a)+f(b))/2.; integ *= h; printf("total integral = %f\n",integ); return 0;} parallel section, explicit computation distribution istart, iend threadprivate global variables function call with global variables inside explicit reduction via mutual exclusion (atomic is faster but only works on one operation) 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 21

Example 2, cont. double get_integ(double a, double h) { int i; double sum,x; sum = 0; for (i=istart;i<iend;i++) { x = a+i*h; sum += f(x); } return sum; } istart, iend threadprivate global variables 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 22

Parallel regions Fortran!$omp parallel!$omp end parallel C/C++ #pragma omp parallel SPMD parallelism replicated execution must be a self-contained block of code 1 entry, 1 exit implicit barrier at the end of parallel region can use the same clauses as in parallel do/for 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 23

Work-sharing constructs DO/for loop distributes loop - do directive Sections breaks work into separate, discrete sections - section directive Workshare parallel execution of separate units of work - workshare directive Single/master serialized section of code - single directive FORK FORK FORK DO/for loop Sections Single JOIN JOIN JOIN 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 24

Work-sharing cont. Restrictions: continuous block; no nesting all threads must reach the same construct constructs can be outside lexical scope of the parallel construct (e.g. subroutine) 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 25

threadprivate variables global/common block variables are private only in lexical scope of the parallel region possible solutions pass private variables as function arguments use threadprivate identifies common block/global variable as private!$omp threadprivate (/cb/ [,/cb/] ) #pragma omp threadprivate (list) use copyin clause to initialize the threadprivate variable e.g.!$omp parallel copyin(istart,iend) 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 26

critical section Mutual exclusion limit access to the part of the code to one thread at the time!$omp critical [name]...!$omp end critical [name] atomic section atomically updating single memory location sum += x also available via runtime library functions 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 27

task construct Used to parallelize irregular, recursive algorithms All tasks run independent of each other in parallel, on up to OMP_NUM_THREADS Use taskwait to wait for all tasks to finish Each task has its own data space use mergeable for shared variables to reduce storage needs Use depend to specify data dependencies Often started from serial section 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 28

task example Calculate Fibonacci number using recursion int fib(int n) { int i, j; if (n<2) return n; else { #pragma omp task shared(i) i=fib(n-1); #pragma omp task shared(j) j=fib(n-2); #pragma omp taskwait return i+j; } #pragma omp parallel { #pragma omp single { fibn = fib(n); }} recursive function independent task #1 independent task #2 wait till completion of both tasks main program need to start parallel section in which the tasks will run 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 29

Event synchronization barrier -!$omp barrier synchronizes all threads at that point ordered -!$omp ordered imposes order across iterations of a parallel loop master -!$omp master sets block of code to be executed only on the master thread flush -!$omp flush synchronizes memory and cache on all threads 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 30

Library functions, environmental variables thread set/inquiry omp_set_num_threads(integer) OMP_NUM_THREADS integer omp_get_num_threads() integer omp_get_max_threads() integer omp_get_thread_num() set/query dynamic thread adjustment omp_set_dynamic(logical) OMP_DYNAMIC logical omp_get_dynamic() 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 31

Library functions, environmental variables lock/unlock functions omp_init_lock() omp_set_lock() omp_unset_lock() logical omp_test_lock() omp_destroy_lock() other integer omp_get_num_procs() logical omp_in_parallel() OMP_SCHEDULE 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 32

Advanced OpenMP nested parallel loops accelerator support (4.0) user defined reduction (4.0) thread affinity (4.0) 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 33

Summary parallel do/for loops variable scope, reduction parallel overhead, loop scheduling parallel regions mutual exclusion work sharing, tasking synchronization http://www.chpc.utah.edu/short_courses/intro_openmp 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 34

References Web http://www.openmp.org/ https://computing.llnl.gov/tutorials/openmp http://openmp.org/mpdocuments/openmp_examples_4.0.1.pdf Books Chapman, Jost, van der Pas Using OpenMP Pacheco Introduction to Parallel Computing Survey https://www.surveymonkey.com/r/8hlj3wk 10/5/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/8hlj3wk Slide 35