Parallel Programming in C with MPI and OpenMP

Similar documents
Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

OpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder

Multicore Computing. Arash Tavakkol. Instructor: Department of Computer Engineering Sharif University of Technology Spring 2016

Allows program to be incrementally parallelized

In-Class Guerrilla Development of MPI Examples

Parallel Programming with OpenMP. CS240A, T. Yang

Introduction to. Slides prepared by : Farzana Rahman 1

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

ECE 574 Cluster Computing Lecture 10

Lecture 4: OpenMP Open Multi-Processing

Synchronization. Event Synchronization

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Concurrent Programming with OpenMP

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Multi-core Architecture and Programming

Parallel Programming using OpenMP

Parallel Programming using OpenMP

Introduction to OpenMP.

EE/CSCI 451: Parallel and Distributed Computation

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Shared Memory Parallelism - OpenMP

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Multithreading in C with OpenMP

Distributed Systems + Middleware Concurrent Programming with OpenMP

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Computer Architecture

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

OPENMP OPEN MULTI-PROCESSING

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Session 4: Parallel Programming with OpenMP

Parallel programming using OpenMP

Overview: The OpenMP Programming Model

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Introduction to OpenMP

Programming with OpenMP*

EPL372 Lab Exercise 5: Introduction to OpenMP

CSL 860: Modern Parallel

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Introduction to OpenMP

An Introduction to OpenMP

DPHPC: Introduction to OpenMP Recitation session

Programming Shared Memory Systems with OpenMP Part I. Book

Introduction to OpenMP

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

Data Environment: Default storage attributes

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Concurrent Programming with OpenMP

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

Parallel Programming

HPCSE - I. «OpenMP Programming Model - Part I» Panos Hadjidoukas

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

Introduction to OpenMP

Cluster Computing 2008

15-418, Spring 2008 OpenMP: A Short Introduction

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell

Synchronisation in Java - Java Monitor

Parallel Computing Parallel Programming Languages Hwansoo Han

CS 470 Spring Mike Lam, Professor. OpenMP

Alfio Lazzaro: Introduction to OpenMP

Introduction to OpenMP

Shared Memory Programming Paradigm!

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

COMP Parallel Computing. SMM (2) OpenMP Programming Model

CS 470 Spring Mike Lam, Professor. Advanced OpenMP


OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

Introduction to OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

High Performance Computing: Tools and Applications

Parallel and Distributed Programming. OpenMP

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

DPHPC: Introduction to OpenMP Recitation session

Transcription:

Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1

Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical sections n Reductions n Performance improvements n More general data parallelism n Functional parallelism OpenMP n OpenMP: An application programming interface (API) for parallel programming on multiprocessors u Compiler directives u Library of support functions n OpenMP works in conjunction with Fortran, C, or C++ 2

What s OpenMP Good For? n C + OpenMP sufficient to program multiprocessors n C + MPI + OpenMP a good way to program multicomputers built out of multiprocessors u IBM RS/6000 SP u Fujitsu AP3000 u Dell High Performance Computing Cluster Shared-memory Model Processor Processor Processor Processor Memory Processors interact and synchronize with each other through shared variables. 3

Fork/Join Parallelism n Initially only master thread is active n Master thread executes sequential code n Fork: Master thread creates or awakens additional threads to execute parallel code n Join: At end of parallel code created threads die or are suspended Fork/Join Parallelism Master Thread Other threads fork Time join fork join 4

Shared-memory Model vs. Message-passing Model (#1) n Shared-memory model u Number active threads 1 at start and finish of program, changes dynamically during execution n Message-passing model u All processes active throughout execution of program Incremental Parallelization n Sequential program a special case of a shared-memory parallel program n Parallel shared-memory programs may only have a single parallel loop n Incremental parallelization: process of converting a sequential program to a parallel program a little bit at a time 5

Shared-memory Model vs. Message-passing Model (#2) n Shared-memory model u Execute and profile sequential program u Incrementally make it parallel u Stop when further effort not warranted n Message-passing model u Sequential-to-parallel transformation requires major effort u Transformation done in one giant step rather than many tiny steps Parallel for Loops n C programs often express data-parallel operations as for loops for (i = first; i < size; i += stride) marked[i] = 1; n OpenMP makes it easy to indicate when the iterations of a loop may execute in parallel n Compiler takes care of generating code that forks/ joins threads and allocates the iterations to threads 6

Pragmas n Pragma: a compiler directive in C or C++ n Stands for pragmatic information n A way for the programmer to communicate with the compiler n Compiler free to ignore pragmas n Syntax: #pragma omp <rest of pragma> Parallel for Pragma n Format: #pragma omp parallel for for (i = 0; i < N; i++) a[i] = b[i] + c[i]; n Compiler must be able to verify the runtime system will have information it needs to schedule loop iterations 7

Canonical Shape of for Loop Control Clause index + + + + index < index index <= for(index = start;index end; index ) index + = = inc >= inc > index = index + inc index = inc + index index index = inc Execution Context n Every thread has its own execution context n Execution context: address space containing all of the variables a thread may access n Contents of execution context: u static variables u dynamically allocated data structures in the heap u variables on the run-time stack u additional run-time stack for functions invoked by the thread 8

Shared and Private Variables n Shared variable: has same address in execution context of every thread n Private variable: has different address in execution context of every thread n A thread cannot access the private variables of another thread Shared and Private Variables int main (int argc, char *argv[]) { int b[3]; char *cptr; int i; Heap Stack b cptr i cptr = malloc(1); #pragma omp parallel for for (i = 0; i < 3; i++) b[i] = i; i Master Thread (Thread 0) i Thread 1 9

Function omp_get_num_procs n Returns number of physical processors available for use by the parallel program int omp_get_num_procs (void) Function omp_set_num_threads n Uses the parameter value to set the number of threads to be active in parallel sections of code n May be called at multiple points in a program void omp_set_num_threads (int t) 10

Pop Quiz: Write a C program segment that sets the number of threads equal to the number of processors that are available. Declaring Private Variables for (i = 0; i < BLOCK_SIZE(id,p,n); i++) for (j = 0; j < n; j++) a[i][j] = MIN(a[i][j],a[i][k]+tmp); n Either loop could be executed in parallel n We prefer to make outer loop parallel, to reduce number of forks/joins n We then must give each thread its own private copy of variable j 11

private Clause n Clause: an optional, additional component to a pragma n Private clause: directs compiler to make one or more variables private private ( <variable list> ) Example Use of private Clause #pragma omp parallel for private(j) for (i = 0; i < BLOCK_SIZE(id,p,n); i++) for (j = 0; j < n; j++) a[i][j] = MIN(a[i][j],a[i][k]+tmp); 12

firstprivate Clause n Used to create private variables having initial values identical to the variable controlled by the master thread as the loop is entered n Variables are initialized once per thread, not once per loop iteration n If a thread modifies a variable s value in an iteration, subsequent iterations will get the modified value lastprivate Clause n Sequentially last iteration: iteration that occurs last when the loop is executed sequentially n lastprivate clause: used to copy back to the master thread s copy of a variable the private copy of the variable from the thread that executed the sequentially last iteration 13

Critical Sections double area, pi, x; int i, n;... area = 0.0; for (i = 0; i < n; i++) { x += (i+0.5)/n; area += 4.0/(1.0 + x*x); pi = area / n; Race Condition n Consider this C program segment to compute π using the rectangle rule: double area, pi, x; int i, n;... area = 0.0; for (i = 0; i < n; i++) { x = (i+0.5)/n; area += 4.0/(1.0 + x*x); pi = area / n; 14

Race Condition (cont.) n If we simply parallelize the loop... double area, pi, x; int i, n;... area = 0.0; #pragma omp parallel for private(x) for (i = 0; i < n; i++) { x = (i+0.5)/n; area += 4.0/(1.0 + x*x); pi = area / n; Race Condition (cont.) n... we set up a race condition in which one process may race ahead of another and not see its change to shared variable area area 11.667 15.432 15.230 Answer should be 18.995 Thread A 11.667 15.432 Thread B 11.667 15.230 area += 4.0/(1.0 + x*x) 15

Race Condition Time Line Value of area Thread A Thread B 11.667 11.667 15.432 15.230 + 3.765 + 3.563 critical Pragma n Critical section: a portion of code that only thread at a time may execute n We denote a critical section by putting the pragma #pragma omp critical in front of a block of C code 16

Correct, But Inefficient, Code double area, pi, x; int i, n;... area = 0.0; #pragma omp parallel for private(x) for (i = 0; i < n; i++) { x = (i+0.5)/n; #pragma omp critical area += 4.0/(1.0 + x*x); pi = area / n; Source of Inefficiency n Update to area inside a critical section n Only one thread at a time may execute the statement; i.e., it is sequential code n Time to execute statement significant part of loop n By Amdahl s Law we know speedup will be severely constrained 17

Reductions n Reductions are so common that OpenMP provides support for them n May add reduction clause to parallel for pragma n Specify reduction operation and reduction variable n OpenMP takes care of storing partial results in private variables and combining partial results after the loop reduction Clause n The reduction clause has this syntax: reduction (<op> :<variable>) n Operators u + Sum u * Product u & Bitwise and u Bitwise or u ^ Bitwise exclusive or u && Logical and u Logical or 18

π-finding Code with Reduction Clause double area, pi, x; int i, n;... area = 0.0; #pragma omp parallel for \ private(x) reduction(+:area) for (i = 0; i < n; i++) { x = (i + 0.5)/n; area += 4.0/(1.0 + x*x); pi = area / n; Performance Improvement #1 n Too many fork/joins can lower performance n Inverting loops may help performance if u Parallelism is in inner loop u After inversion, the outer loop can be made parallel u Inversion does not significantly lower cache hit rate 19

Performance Improvement #2 n If loop has too few iterations, fork/join overhead is greater than time savings from parallel execution n The if clause instructs compiler to insert code that determines at run-time whether loop should be executed in parallel; e.g., #pragma omp parallel for if(n > 5000) Performance Improvement #3 n We can use schedule clause to specify how iterations of a loop should be allocated to threads n Static schedule: all iterations allocated to threads before any iterations executed n Dynamic schedule: only some iterations allocated to threads at beginning of loop s execution. Remaining iterations allocated to threads that complete their assigned iterations. 20

Static vs. Dynamic Scheduling n Static scheduling u Low overhead u May exhibit high workload imbalance n Dynamic scheduling u Higher overhead u Can reduce workload imbalance Chunks n A chunk is a contiguous range of iterations n Increasing chunk size reduces overhead and may increase cache hit rate n Decreasing chunk size allows finer balancing of workloads 21

schedule Clause n Syntax of schedule clause schedule (<type>[,<chunk> ]) n Schedule type required, chunk size optional n Allowable schedule types u static: static allocation u dynamic: dynamic allocation u guided: guided self-scheduling u runtime: type chosen at run-time based on value of environment variable OMP_SCHEDULE Scheduling Options n schedule(static): block allocation of about n/ t contiguous iterations to each thread n schedule(static,c): interleaved allocation of chunks of size C to threads n schedule(dynamic): dynamic one-at-a-time allocation of iterations to threads n schedule(dynamic,c): dynamic allocation of C iterations at a time to threads 22

Scheduling Options (cont.) n schedule(guided, C): dynamic allocation of chunks to tasks using guided self-scheduling heuristic. Initial chunks are bigger, later chunks are smaller, minimum chunk size is C. n schedule(guided): guided self-scheduling with minimum chunk size 1 n schedule(runtime): schedule chosen at run-time based on value of OMP_SCHEDULE; Unix example: setenv OMP_SCHEDULE static,1 More General Data Parallelism n Our focus has been on the parallelization of for loops n Other opportunities for data parallelism u processing items on a to do list u for loop + additional code outside of loop 23

Processing a To Do List Heap job_ptr Shared Variables task_ptr task_ptr Master Thread Thread 1 Sequential Code (1/2) int main (int argc, char *argv[]) { struct job_struct *job_ptr; struct task_struct *task_ptr;... task_ptr = get_next_task (&job_ptr); while (task_ptr!= NULL) { complete_task (task_ptr); task_ptr = get_next_task (&job_ptr);... 24

Sequential Code (2/2) char *get_next_task(struct job_struct **job_ptr) { struct task_struct *answer; if (*job_ptr == NULL) answer = NULL; else { answer = (*job_ptr)->task; *job_ptr = (*job_ptr)->next; return answer; Parallelization Strategy n Every thread should repeatedly take next task from list and complete it, until there are no more tasks n We must ensure no two threads take same take from the list; i.e., must declare a critical section 25

parallel Pragma n The parallel pragma precedes a block of code that should be executed by all of the threads n Note: execution is replicated among all threads Use of parallel Pragma #pragma omp parallel private(task_ptr) { task_ptr = get_next_task (&job_ptr); while (task_ptr!= NULL) { complete_task (task_ptr); task_ptr = get_next_task (&job_ptr); 26

Critical Section for get_next_task char *get_next_task(struct job_struct **job_ptr) { struct task_struct *answer; #pragma omp critical { if (*job_ptr == NULL) answer = NULL; else { answer = (*job_ptr)->task; *job_ptr = (*job_ptr)->next; return answer; Functions for SPMD-style Programming n The parallel pragma allows us to write SPMD-style programs n In these programs we often need to know number of threads and thread ID number n OpenMP provides functions to retrieve this information 27

Function omp_get_thread_num n This function returns the thread identification number n If there are t threads, the ID numbers range from 0 to t-1 n The master thread has ID number 0 int omp_get_thread_num (void) Function omp_get_num_threads n Function omp_get_num_threads returns the number of active threads n If call this function from sequential portion of program, it will return 1 int omp_get_num_threads (void) 28

for Pragma n The parallel pragma instructs every thread to execute all of the code inside the block n If we encounter a for loop that we want to divide among threads, we use the for pragma #pragma omp for Example Use of for Pragma #pragma omp parallel private(i,j) for (i = 0; i < m; i++) { low = a[i]; high = b[i]; if (low > high) { printf ("Exiting (%d)\n", i); break; #pragma omp for for (j = low; j < high; j++) c[j] = (c[j] - a[i])/b[i]; 29

single Pragma n Suppose we only want to see the output once n The single pragma directs compiler that only a single thread should execute the block of code the pragma precedes n Syntax: #pragma omp single Use of single Pragma #pragma omp parallel private(i,j) for (i = 0; i < m; i++) { low = a[i]; high = b[i]; if (low > high) { #pragma omp single printf ("Exiting (%d)\n", i); break; #pragma omp for for (j = low; j < high; j++) c[j] = (c[j] - a[i])/b[i]; 30

nowait Clause n Compiler puts a barrier synchronization at end of every parallel for statement n In our example, this is necessary: if a thread leaves loop and changes low or high, it may affect behavior of another thread n If we make these private variables, then it would be okay to let threads move ahead, which could reduce execution time Use of nowait Clause #pragma omp parallel private(i,j,low,high) for (i = 0; i < m; i++) { low = a[i]; high = b[i]; if (low > high) { #pragma omp single printf ("Exiting (%d)\n", i); break; #pragma omp for nowait for (j = low; j < high; j++) c[j] = (c[j] - a[i])/b[i]; 31

Functional Parallelism n To this point all of our focus has been on exploiting data parallelism n OpenMP allows us to assign different threads to different portions of code (functional parallelism) Functional Parallelism Example v = alpha(); w = beta(); x = gamma(v, w); y = delta(); printf ("%6.2f\n", epsilon(x,y)); alpha beta May execute alpha, beta, and delta in parallel gamma epsilon delta 32

parallel sections Pragma n Precedes a block of k blocks of code that may be executed concurrently by k threads n Syntax: #pragma omp parallel sections section Pragma n Precedes each block of code within the encompassing block preceded by the parallel sections pragma n May be omitted for first parallel section after the parallel sections pragma n Syntax: #pragma omp section 33

Example of parallel sections #pragma omp parallel sections { #pragma omp section /* Optional */ v = alpha(); #pragma omp section w = beta(); #pragma omp section y = delta(); x = gamma(v, w); printf ("%6.2f\n", epsilon(x,y)); Another Approach alpha gamma beta delta Execute alpha and beta in parallel. Execute gamma and delta in parallel. epsilon 34

sections Pragma n Appears inside a parallel block of code n Has same meaning as the parallel sections pragma n If multiple sections pragmas inside one parallel block, may reduce fork/join costs Use of sections Pragma #pragma omp parallel { #pragma omp sections { v = alpha(); #pragma omp section w = beta(); #pragma omp sections { x = gamma(v, w); #pragma omp section y = delta(); printf ("%6.2f\n", epsilon(x,y)); 35

Summary (1/3) n OpenMP an API for shared-memory parallel programming n Shared-memory model based on fork/join parallelism n Data parallelism u parallel for pragma u reduction clause Summary (2/3) n Functional parallelism (parallel sections pragma) n SPMD-style programming (parallel pragma) n Critical sections (critical pragma) n Enhancing performance of parallel for loops u Inverting loops u Conditionally parallelizing loops u Changing loop scheduling 36

Summary (3/3) Characteristic OpenMP MPI Suitable for multiprocessors Yes Yes Suitable for multicomputers No Yes Supports incremental parallelization Yes No Minimal extra code Yes No Explicit control of memory hierarchy No Yes 37