OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN

Similar documents
COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

OpenMP Tutorial. Rudi Eigenmann of Purdue, Sanjiv Shah of Intel and others too numerous to name have contributed content for this tutorial.

Multithreading in C with OpenMP

Introduction to. Slides prepared by : Farzana Rahman 1

EPL372 Lab Exercise 5: Introduction to OpenMP

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Parallel Programming Languages 1 - OpenMP

Multi-core Architecture and Programming

Parallel Programming in C with MPI and OpenMP

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming in C with MPI and OpenMP

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 4: OpenMP Open Multi-Processing

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

ECE 574 Cluster Computing Lecture 10

Shared Memory Parallelism - OpenMP

Allows program to be incrementally parallelized

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Parallel Programming in C with MPI and OpenMP

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

COMP Parallel Computing. SMM (2) OpenMP Programming Model

Shared Memory Parallelism using OpenMP

Shared Memory Programming with OpenMP

DPHPC: Introduction to OpenMP Recitation session

Parallel Computing Parallel Programming Languages Hwansoo Han

Introduction to Standard OpenMP 3.1

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Programming with OpenMP*

Alfio Lazzaro: Introduction to OpenMP

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

Introduction to OpenMP

Lecture 2 A Hand-on Introduction to OpenMP

OpenMP loops. Paolo Burgio.

Distributed Systems + Middleware Concurrent Programming with OpenMP

OpenMP Fundamentals Fork-join model and data environment

Session 4: Parallel Programming with OpenMP

Computer Architecture

Assignment 1 OpenMP Tutorial Assignment

Introduction to OpenMP

Programming Shared Memory Systems with OpenMP Part I. Book

OpenMP. Today s lecture. Scott B. Baden / CSE 160 / Wi '16

Introduction to OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

Parallel Programming with OpenMP. CS240A, T. Yang

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Shared Memory programming paradigm: openmp

Parallel and Distributed Computing

Concurrent Programming with OpenMP

Computational Mathematics

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

Introduction to OpenMP. Lecture 4: Work sharing directives

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

Overview: The OpenMP Programming Model

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

/Users/engelen/Sites/HPC folder/hpc/openmpexamples.c

Introduction to OpenMP.

OpenMP on Ranger and Stampede (with Labs)

Announcements. Scott B. Baden / CSE 160 / Wi '16 2

CSL 860: Modern Parallel

Parallel Numerical Algorithms

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

EE/CSCI 451: Parallel and Distributed Computation

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

Parallel programming using OpenMP

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

Introduction to OpenMP

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Shared Memory Programming Paradigm!

Lab: Scientific Computing Tsunami-Simulation

HPCSE - I. «OpenMP Programming Model - Part I» Panos Hadjidoukas

CSE 160 Lecture 8. NUMA OpenMP. Scott B. Baden

OpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

15-418, Spring 2008 OpenMP: A Short Introduction

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Introduction to OpenMP

GCC Developers Summit Ottawa, Canada, June 2006

Introduction to OpenMP

A brief introduction to OpenMP

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Multicore Computing. Arash Tavakkol. Instructor: Department of Computer Engineering Sharif University of Technology Spring 2016

Introduction to OpenMP

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

Transcription:

OpenMP Tutorial Seung-Jai Min (smin@purdue.edu) School of Electrical and Computer Engineering Purdue University, West Lafayette, IN 1

Parallel Programming Standards Thread Libraries - Win32 API / Posix threads Compiler Directives OUR FOCUS - OpenMP (Shared memory programming) Message Passing Libraries - MPI (Distributed memory programming) 2

Shared Memory Parallel Programming in the Multi-Core Era Desktop and Laptop 2, 4, 8 cores and? A single node in distributed memory clusters Steele cluster node: 2 8 (16) cores Shared memory hardware Accelerators Cell processors: 1 PPE and 8 SPEs Nvidia Quadro GPUs: 128 processing units 3

OpenMP: Some syntax details to get us started Most of the constructs in OpenMP are compiler directives or pragmas. For C and C++, the pragmas take the form: #pragma omp construct [clause [clause] ] For Fortran, the directives take one of the forms: C$OMP construct [clause [clause] ]!$OMP construct [clause [clause] ] *$OMP construct [clause [clause] ] Include files #include omp.h 4

How is OpenMP typically used? OpenMP is usually used to parallelize loops: Find your most time consuming loops. Split them up between threads. Sequential Program void main() { int i, k, N=1000; double A[N], B[N], C[N]; for (i=0; i<n; i++) { A[i] = B[i] + k*c[i] Parallel Program #include omp.h void main() { int i, k, N=1000; double A[N], B[N], C[N]; #pragma omp parallel for for (i=0; i<n; i++) { A[i] = B[i] + k*c[i]; 5

How is OpenMP typically used? Single Program Multiple Data (SPMD) (Cont.) Parallel Program #include omp.h void main() { int i, k, N=1000; double A[N], B[N], C[N]; #pragma omp parallel for for (i=0; i<n; i++) { A[i] = B[i] + k*c[i]; 6

How is OpenMP typically used? Single Program Multiple Data (SPMD) (Cont.) Thread 0 Thread 1 void main() Thread 2 { void main() Thread 3 int i, k, N=1000; { void main() double A[N], B[N], int i, C[N]; k, N=1000; { void main() lb = 0; double A[N], B[N], int i, C[N]; k, N=1000; { ub = 250; lb = 250; double A[N], B[N], int i, C[N]; k, N=1000; for (i=lb;i<ub;i++) = 500; { lb = 500; double A[N], B[N], C[N]; A[i] = B[i] for + k*c[i]; (i=lb;i<ub;i++) = 750; { lb = 750; A[i] = B[i] for + k*c[i]; (i=lb;i<ub;i++) = 1000; { A[i] = B[i] for + k*c[i]; (i=lb;i<ub;i++) { A[i] = B[i] + k*c[i]; 7

OpenMP Fork-and-Join model printf( program begin\n ); N = 1000; #pragma omp parallel for for (i=0; i<n; i++) A[i] = B[i] + C[i]; M = 500; #pragma omp parallel for for (j=0; j<m; j++) p[j] = q[j] r[j]; printf( program done\n ); Serial Parallel Serial Parallel Serial 8

OpenMP Constructs 1. Parallel Regions #pragma omp parallel 2. Worksharing #pragma omp for, #pragma omp sections 3. Data Environment #pragma omp parallel shared/private ( ) 4. Synchronization #pragma omp barrier 5. Runtime functions/environment variables int my_thread_id = omp_get_num_threads(); omp_set_num_threads(8); 9

OpenMP: Structured blocks Most OpenMP constructs apply to structured blocks. Structured block: one point of entry at the top and one point of exit at the bottom. The only branches allowed are STOP statements in Fortran and exit() in C/C++. 10

OpenMP: Structured blocks A Structured Block Not A Structured Block #pragma omp parallel { more: do_big_job(id); if(++count>1) goto more; printf( All done \n ); if(count==1) goto more; #pragma omp parallel { more: do_big_job(id); if(++count>1) goto done; done: if(!really_done()) goto more; 11

Structured Block Boundaries In C/C++: a block is a single statement or a group of statements between brackets { #pragma omp parallel { id = omp_thread_num(); A[id] = big_compute(id); #pragma omp for for (I=0;I<N;I++) { res[i] = big_calc(i); A[I] = B[I] + res[i]; 12

Structured Block Boundaries In Fortran: a block is a single statement or a group of statements between directive/end-directive pairs. C$OMP PARALLEL 10 W(id) = garbage(id) res(id) = W(id)**2 if(res(id) goto 10 C$OMP END PARALLEL C$OMP PARALLEL DO do I=1,N res(i)=bigcomp(i) end do C$OMP END PARALLEL DO 13

OpenMP Parallel Regions double A[1000]; omp_set_num_threads(4) double A[1000]; omp_set_num_threads(4); #pragma omp parallel { int ID = omp_get_thread_num(); pooh(id, A); printf( all done\n ); A single copy of A is shared between all threads. pooh(0,a) pooh(1,a) pooh(2,a) pooh(3,a) printf( all done\n ); Implicit barrier: threads wait here for all threads to finish before proceeding 14

The OpenMP API Combined parallel work-share OpenMP shortcut: Put the parallel and the work-share on the same line int i; double res[max]; #pragma omp parallel { #pragma omp for for (i=0;i< MAX; i++) { res[i] = huge(); int i; double res[max]; #pragma omp parallel for for (i=0;i< MAX; i++) { res[i] = huge(); the same OpenMP 15

Shared Memory Model private private thread1 thread2 Data can be shared or private thread5 private Shared Memory thread4 thread3 private Shared data is accessible by all threads Private data can be accessed only by the thread that owns it Data transfer is transparent to the programmer private 16

Data Environment: Default storage attributes Shared Memory programming model Variables are shared by default Distributed Memory Programming Model All variables are private 17

Data Environment: Default storage attributes Global variables are SHARED among threads Fortran: COMMON blocks, SAVE variables, MODULE variables C: File scope variables, static But not everything is shared... Stack variables in sub-programs called from parallel regions are PRIVATE Automatic variables within a statement block are PRIVATE. 18

Data Environment int A[100]; /* (Global) SHARED */ int foo(int x) { /* PRIVATE */ int count=0; return x*count; int main() { int ii, jj; /* PRIVATE */ int B[100]; /* SHARED */ #pragma omp parallel private(jj) { int kk = 1; /* PRIVATE */ #pragma omp for for (ii=0; ii<n; ii++) for (jj=0; jj<n; jj++) A[ii][jj] = foo(b[ii][jj]); 19

Work Sharing Construct Loop Construct #pragma omp for [clause[[,] clause ] new-line for-loops Where clause is one of the following: private / firstprivate / lastprivate(list) reduction(operator: list) schedule(kind[, chunk_size]) collapse(n) ordered nowait 20

Schedule for (i=0; i<1100; i++) A[i] = ; #pragma omp parallel for schedule (static, 250) or (static) 250 250 250 250 100 p0 p1 p2 p3 p0 or 275 275 275 275 p0 p1 p2 p3 #pragma omp parallel for schedule (dynamic, 200) 200 200 200 200 200 100 p3 p0 p2 p3 p1 p0 #pragma omp parallel for schedule (guided, 100) 137 120 105 100 100 100 100 100 100 100 38 p0 p3 p0 p1 p2 p3 p0 p1 p2 p3 p0 #pragma omp parallel for schedule (auto) 21

Critical Construct sum = 0; #pragma omp parallel private (lsum) { lsum = 0; #pragma omp for for (i=0; i<n; i++) { lsum = lsum + A[i]; #pragma omp critical { sum += lsum; Threads wait their turn; only one thread at a time executes the critical section 22

Reduction Clause Shared variable sum = 0; #pragma omp parallel for reduction (+:sum) for (i=0; i<n; i++) { sum = sum + A[i]; 23

Performance Evaluation How do we measure performance? (or how do we remove noise?) #define N 24000 For (k=0; k<10; k++) { #pragma omp parallel for private(i, j) for (i=1; i<n-1; i++) for (j=1; j<n-1; j++) a[i][j] = (b[i][j-1]+b[i][j+1])/2.0; 24

Performance Issues What if you see a speedup saturation? Speedup 1 2 4 6 8 # CPUs #define N 12000 #pragma omp parallel for private(j) for (i=1; i<n-1; i++) for (j=1; j<n-1; j++) a[i][j] = (b[i][j-1]+b[i][j]+b[i][j+1] b[i-1][j]+b[i+1][j])/5.0; 25

Performance Issues What if you see a speedup saturation? Speedup 1 2 4 6 8 # CPUs #define N 12000 #pragma omp parallel for private(j) for (i=1; i<n-1; i++) for (j=1; j<n-1; j++) a[i][j] = b[i][j]; 26

Loop Scheduling Any guideline for a chunk size? #define N <big-number> chunk =???; #pragma omp parallel for schedule (static, chunk) for (i=1; i<n-1; i++) a[i][j] = ( b[i-2] + b[i-1] + b[i] b[i+1] + b[i+2] )/5.0; 27

Performance Issues Load imbalance: triangular access pattern #define N 12000 #pragma omp parallel for private(j) for (i=1; i<n-1; i++) for (j=i; j<n-1; j++) a[i][j] = (b[i][j-1]+b[i][j]+b[i][j+1] b[i-1][j]+b[i+1][j])/5.0; 28

Summary OpenMP has advantages Incremental parallelization Compared to MPI No data partitioning No communication scheduling 29

Resources http://www.openmp.org http://openmp.org/wp/resources 30