Introduction to OpenMP

Similar documents
Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Data Handling in OpenMP

Introduction to. Slides prepared by : Farzana Rahman 1

Parallel and Distributed Programming. OpenMP

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

ECE 574 Cluster Computing Lecture 10

Introduction to OpenMP

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Parallel Programming using OpenMP

OPENMP OPEN MULTI-PROCESSING

Parallel Programming using OpenMP

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Programming Shared Memory Systems with OpenMP Part I. Book

Lecture 4: OpenMP Open Multi-Processing

Parallel Programming in C with MPI and OpenMP

HPCSE - I. «OpenMP Programming Model - Part I» Panos Hadjidoukas

Data Environment: Default storage attributes

Introduction to OpenMP. Lecture 4: Work sharing directives

Parallel Programming with OpenMP. CS240A, T. Yang

Multi-core Architecture and Programming

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Parallel Programming in C with MPI and OpenMP

Introduction [1] 1. Directives [2] 7

Compiling and running OpenMP programs. C/C++: cc fopenmp o prog prog.c -lomp CC fopenmp o prog prog.c -lomp. Programming with OpenMP*

Introduction to OpenMP

OpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder

Shared Memory Parallelism - OpenMP

Introduction to OpenMP

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Introduction to Programming with OpenMP

OpenMP on Ranger and Stampede (with Labs)

OpenMP+F90 p OpenMP+F90

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Introduction to OpenMP

Shared Memory Programming with OpenMP

OpenMP Application Program Interface

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

OpenMP Language Features

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Parallel programming using OpenMP

EPL372 Lab Exercise 5: Introduction to OpenMP

PC to HPC. Xiaoge Wang ICER Jan 27, 2016

OpenMP Application Program Interface

Parallel Programming in C with MPI and OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Introduction to OpenMP

Introduction to OpenMP

Programming Shared-memory Platforms with OpenMP. Xu Liu

Shared Memory programming paradigm: openmp

Introduction to OpenMP.

Introduction to OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

Introduction to OpenMP. Rogelio Long CS 5334/4390 Spring 2014 February 25 Class

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Alfio Lazzaro: Introduction to OpenMP

Introduction to OpenMP

A brief introduction to OpenMP

OpenMP. A.Klypin. Shared memory and OpenMP. Simple Example. Threads. Dependencies. Directives. Handling Common blocks.

Introduction to OpenMP. Motivation

CSL 860: Modern Parallel

CS 470 Spring Mike Lam, Professor. OpenMP


COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Review. Tasking. 34a.cpp. Lecture 14. Work Tasking 5/31/2011. Structured block. Parallel construct. Working-Sharing contructs.

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Shared Memory Parallelism using OpenMP

Shared memory programming

Cluster Computing 2008

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

Introduction to OpenMP

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

OpenMP Tutorial. Rudi Eigenmann of Purdue, Sanjiv Shah of Intel and others too numerous to name have contributed content for this tutorial.

OpenMP Technical Report 3 on OpenMP 4.0 enhancements

OpenMP. Table of Contents

Computer Architecture

Session 4: Parallel Programming with OpenMP

Multithreading in C with OpenMP

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Programming with OpenMP*

Practical in Numerical Astronomy, SS 2012 LECTURE 12

Synchronisation in Java - Java Monitor

High Performance Computing: Tools and Applications

Transcription:

Presentation Introduction to OpenMP Martin Cuma Center for High Performance Computing University of Utah mcuma@chpc.utah.edu September 9, 2004 http://www.chpc.utah.edu 4/13/2006 http://www.chpc.utah.edu Slide 1

Overview Quick introduction. Parallel loops. Parallel loop directives. Parallel sections. Some more advanced directives. Summary. 4/13/2006 http://www.chpc.utah.edu Slide 2

Shared memory All processors have access to local memory Simpler programming Concurrent memory access More specialized hardware CHPC : HP Alpha 4-way (sierra) PC Linux 2-way (icebox, arches) 4/13/2006 http://www.chpc.utah.edu Slide 3

OpenMP basics Directives Runtime Library routines Environment variables Compiler directives to parallelize Fortran source code comments!$omp parallel/!$omp end parallel C/C++ - #pragmas #pragma omp parallel Small set of subroutines, environment variables!$ iam = omp_get_num_threads() 4/13/2006 http://www.chpc.utah.edu Slide 4

4/13/2006 http://www.chpc.utah.edu Slide 5 Example 1 numerical integration [ ] [ ] [ ] = = + + = + 1 1 0 1 1 ) ( ) ( ) ( 2 1 ) ( ) ( 2 1 ) ( n i i n n i i i b a x f h x f x f h x f x f h x f

Program code program trapezoid integer n, i double precision a, b, h, x, integ, f 1. print*,"input integ. interval, no. of trap:" read(*,*)a, b, n h = (b-a)/n integ = 0. 2. 3.!$omp parallel do reduction(+:integ) private(x) do i=1,n-1 x = a+i*h integ = integ + f(x) enddo integ = integ + (f(a)+f(b))/2. integ = integ*h print*,"total integral = ",integ 4/13/2006 end http://www.chpc.utah.edu Slide 6

Program output sierra-int:>%f77 -omp check_omp trap.f -o trap sierra-int:>%setenv OMP_NUM_THREADS 4 sierra-int:>%trap Input integ. interval, no. of trap: 0 10 100 Total integral = 333.3500000000001 icebox:>%pgf77 -mp trap.f -o trap 4/13/2006 http://www.chpc.utah.edu Slide 7

Parallel do directive Fortran!$omp parallel do [clause [, clause]] [!$omp end parallel do] C/C++ #pragma omp parallel for [clause [clause]] Loops must have precisely determined trip count no do-while loops no change to loop indices, bounds inside loop (C) no jumps out of the loop (Fortran exit, goto; C break, goto) cycle (Fortran), continue (C) are allowed stop (Fortran), exit (C) are allowed 4/13/2006 http://www.chpc.utah.edu Slide 8

Clauses Control execution of parallel loop scope sharing of variables among the threads if whether to run in parallel or in serial schedule distribution of work across the threads ordered perform loop in certain order copyin initialize private variables in the loop 4/13/2006 http://www.chpc.utah.edu Slide 9

Data sharing private each thread creates a private instance not initialized upon entry to parallel region undefined upon exit from parallel region default for loop indices, variables declared inside parallel loop shared all threads share one copy update modifies data for all other threads default everything else Changing default behavior default (shared private none) 4/13/2006 http://www.chpc.utah.edu Slide 10

Variable initialization, reduction firstprivate/lastprivate clause initialization of a private variable!$omp parallel do firstprivate(x) finalization of a private variable!$omp parallel do lastprivate(x) reduction clause performs global operation on a variable!$omp parallel do reduction (+ : sum) 4/13/2006 http://www.chpc.utah.edu Slide 11

Removing data dependencies Serial trapezoidal rule integ = 0. do i=1,n-1 x = a+i*h integ = integ + f(x) enddo Parallel solution integ = 0.!$omp parallel do do i=1,n-1 x = a+i*h integ = integ + f(x) enddo Thread 1 x=a+i*h integ=integ+f(x) Thread 2 x=a+i*h private(x)reduction (+:integ) integ=integ+f(x) 4/13/2006 http://www.chpc.utah.edu Slide 12

Data dependence classification Anti-dependence race between statement S 1 writing and S 2 reading removal: privatization, multiple do loops Output dependence values from the last iteration used outside the loop removal: lastprivate clause Flow dependence data at one iteration depend on data from another iteration removal: reduction, rearrangement, often impossible x = (b(i) + c(i))/2 a(i) = a(i+1) + x 4/13/2006 http://www.chpc.utah.edu Slide 13 1 2

Parallel overhead Parallelization costs CPU time Nested loops parallelize the outermost loop if clause parallelize only when it is worth it above certain number of iterations:!$omp parallel do if (n.ge. 800) do i = 1, n... enddo 4/13/2006 http://www.chpc.utah.edu Slide 14

Load balancing - scheduling user-defined work distribution schedule (type[, chunk]) chunk number of iterations contiguously assigned to threads type static each thread gets a constant chunk dynamic work distribution to threads varies guided chunk size exponentially decreases runtime schedule decided at the run time 4/13/2006 http://www.chpc.utah.edu Slide 15

Static schedule timings Time (sec) 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 3 2.5 2 1.5 1 0.5 0 2 Threads 4 Threads 8 Threads on SGI Origin 2000 0 10 20 30 40 50 60 0 1000 2000Chunk 3000 4000 5000 Default Niter/Nproc 4/13/2006 http://www.chpc.utah.edu Slide 16

Schedule comparison 0.05 Time (sec) 0.04 0.03 0.02 Dynamic schedule Guided schedule Static schedule on SGI Origin 2000 NUM_OMP_THREADS = 8 Default 0.01 0 0 1000 2000 3000 4000 5000 Chunk 4/13/2006 http://www.chpc.utah.edu Slide 17

Example 2 MPI-like parallelization #include <stdio.h> #include "omp.h" #define min(a,b) ((a) < (b)? (a) : (b)) 1. int istart,iend; #pragma omp threadprivate(istart,iend) int main (int argc, char* argv[]){ int n,nthreads,iam,chunk; float a, b; double h, integ, p_integ; double f(double x); double get_integ(double a, double h); 2. printf("input integ. interval, no. of trap:\n"); scanf("%f %f %d",&a,&b,&n); h = (b-a)/n; integ = 0.; 4/13/2006 http://www.chpc.utah.edu Slide 18

Example 2, cont. 3. 4. 5. 6. #pragma omp parallel shared(integ) private(p_integ,nthreads,iam,chunk){ nthreads = omp_get_num_threads(); iam = omp_get_thread_num(); chunk = (n + nthreads -1)/nthreads; istart = iam * chunk + 1; iend = min((iam+1)*chunk+1,n); p_integ = get_integ(a,h); #pragma omp critical integ += p_integ; } integ += (f(a)+f(b))/2.; integ *= h; printf("total integral = %f\n",integ); return 0;} 4/13/2006 http://www.chpc.utah.edu Slide 19

Example 2, cont. double get_integ(double a, double h) { int i; double sum,x; sum = 0; for (i=istart;i<iend;i++) { x = a+i*h; sum += f(x); } return sum; } 4/13/2006 http://www.chpc.utah.edu Slide 20

Parallel regions Fortran!$omp parallel!$omp end parallel C/C++ #pragma omp parallel SPMD parallelism replicated execution must be a self-contained block of code 1 entry, 1 exit implicit barrier at the end of parallel region can use the same clauses as in parallel do/for 4/13/2006 http://www.chpc.utah.edu Slide 21

threadprivate variables global/common block variables are private only in lexical scope of the parallel region possible solutions pass private variables as function arguments use threadprivate identifies common block/global variable as private!$omp threadprivate (/cb/ [,/cb/] ) #pragma omp threadprivate (list) use copyin clause to initialize the threadprivate variable e.g.!$omp parallel copyin(istart,iend) 4/13/2006 http://www.chpc.utah.edu Slide 22

Mutual exclusion critical section limit access to the part of the code to one thread at the time!$omp critical [name]...!$omp end critical [name] atomic section atomically updating single memory location sum += x runtime library functions 4/13/2006 http://www.chpc.utah.edu Slide 23

Library functions, environmental variables thread set/inquiry omp_set_num_threads(integer) OMP_NUM_THREADS integer omp_get_num_threads() integer omp_get_max_threads() integer omp_get_thread_num() set/query dynamic thread adjustment omp_set_dynamic(logical) OMP_DYNAMIC logical omp_get_dynamic() 4/13/2006 http://www.chpc.utah.edu Slide 24

Library functions, environmental variables lock/unlock functions omp_init_lock() omp_set_lock() omp_unset_lock() logical omp_test_lock() omp_destroy_lock() other integer omp_get_num_procs() logical omp_in_parallel() OMP_SCHEDULE 4/13/2006 http://www.chpc.utah.edu Slide 25

Work-sharing Distribution of work among the processes Work sharing constructs do directive same as parallel do section directive divide code into sections which run on separate threads single directive assign work to a single thread Restrictions: continuous block; no nesting all threads must reach the same construct constructs can be outside lexical scope of the parallel construct (e.g. subroutine) 4/13/2006 http://www.chpc.utah.edu Slide 26

Event synchronization barrier -!$omp barrier synchronizes all threads at that point ordered -!$omp ordered imposes order across iterations of a parallel loop master -!$omp master sets block of code to be executed only on the master thread flush -!$omp flush synchronizes memory and cache on all threads 4/13/2006 http://www.chpc.utah.edu Slide 27

Summary parallel do/for loops variable scope, reduction parallel overhead, loop scheduling parallel regions mutual exclusion work sharing synchronization http://www.chpc.utah.edu/short_courses/intro_openmp 4/13/2006 http://www.chpc.utah.edu Slide 28

References Web http://www.openmp.org/ http://www.openmp.org/specs/ Book Chandra, Dagum, Kohr, - Parallel Programming in OpenMP 4/13/2006 http://www.chpc.utah.edu Slide 29