Parallel and Distributed Computing

Similar documents
Concurrent Programming with OpenMP

Concurrent Programming with OpenMP

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Parallel Programming with OpenMP. CS240A, T. Yang

OpenMP! baseado em material cedido por Cristiana Amza!

EPL372 Lab Exercise 5: Introduction to OpenMP

Lecture 4: OpenMP Open Multi-Processing

15-418, Spring 2008 OpenMP: A Short Introduction

Multithreading in C with OpenMP

ECE 574 Cluster Computing Lecture 10

Shared Memory Parallel Programming

Introduction to. Slides prepared by : Farzana Rahman 1

Introduction to OpenMP

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Parallel Programming in C with MPI and OpenMP

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Parallel Programming in C with MPI and OpenMP

CSL 860: Modern Parallel

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Multi-core Architecture and Programming

Introduction to OpenMP

Programming Shared Memory Systems with OpenMP Part I. Book

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Shared Memory Parallelism - OpenMP

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Shared Memory Parallelism using OpenMP

Distributed Systems + Middleware Concurrent Programming with OpenMP

Introduction to OpenMP.

Allows program to be incrementally parallelized

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Overview: The OpenMP Programming Model

Parallel Computing Parallel Programming Languages Hwansoo Han

Parallel Programming in C with MPI and OpenMP

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Parallel Programming

High Performance Computing: Tools and Applications

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Introduction to OpenMP

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

EE/CSCI 451: Parallel and Distributed Computation

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Parallel Programming using OpenMP

Parallel Programming using OpenMP

Session 4: Parallel Programming with OpenMP

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

Shared Memory Programming Model

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Synchronization. Event Synchronization

OpenMP threading: parallel regions. Paolo Burgio

An Introduction to OpenMP

/Users/engelen/Sites/HPC folder/hpc/openmpexamples.c

Computer Architecture

NUMERICAL PARALLEL COMPUTING

OpenMP - Introduction

CS 470 Spring Mike Lam, Professor. OpenMP

Programming Shared Address Space Platforms using OpenMP

Introduction to OpenMP. Rogelio Long CS 5334/4390 Spring 2014 February 25 Class

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

Introduction to Standard OpenMP 3.1

DPHPC: Introduction to OpenMP Recitation session

Alfio Lazzaro: Introduction to OpenMP

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

Data Environment: Default storage attributes

Programming with OpenMP*

Shared Memory Programming Paradigm!

Shared Memory Programming with OpenMP

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

Shared-memory Programming

Introduction to OpenMP

Computational Mathematics

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell

Introduction to OpenMP

COMP Parallel Computing. SMM (2) OpenMP Programming Model

CS 470 Spring Mike Lam, Professor. OpenMP

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

CS4961 Parallel Programming. Lecture 13: Task Parallelism in OpenMP 10/05/2010. Programming Assignment 2: Due 11:59 PM, Friday October 8

GCC Developers Summit Ottawa, Canada, June 2006

Transcription:

Concurrent Programming with OpenMP Rodrigo Miragaia Rodrigues MSc in Information Systems and Computer Engineering DEA in Computational Engineering CS Department (DEI) Instituto Superior Técnico October 1 and 3, 2007

Parallel Programming How to write a program with concurrent execution flows? Let s revisit what you learned in your OS class [ ] 2 years ago

Threads Process A Shared Code and Global Variables Process B Independent Execution Flows

POSIX Threads (pthreads) int pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine)(void *), void *arg); Example: pthread_t pt_worker; void *thread_function(void *args) { /* thread code */ pthread_create(&pt_worker, NULL, thread_function, (void *)thread_args);

pthreads: Termination and Synchronization int pthread_exit(void *value_ptr) int pthread_join(pthread_t thread, void **value_ptr)

pthread Example: Summing the Values in Matrix Rows #include <stdlib.h> #include <stdio.h> #include <unistd.h> #include <pthread.h> int buffer[n][size]; void *sum_row (void *ptr) { int index = 0, sum = 0; int *b = (int *) ptr; while (index < SIZE - 1) { sum += b[index++]; /* sum row*/ b[index]=sum; /* store sum in last col. */ pthread_exit(null); int main (void) { int i,j; pthread_t tid[n]; for (i=0; i<n; i++){ for (j=0; j< SIZE-1; j++) buffer[i][j] =rand()%10; for (i=0; i<n; i++){ if(pthread_create (&tid[i], 0,sum_row, (void *) &(buffer[i]))!= 0) { printf( Error creating thread\n"); exit(-1); else { printf ( Created thread w/ id %d\n", tid[i]); for (i=0; i<n; i++){ pthread_join (tid[i], NULL); printf ( All threads have concluded\n"); for(i=0; i< N; i++) { for (j=0; j < SIZE; j++) printf (" %d ", buffer[i][j]); printf ( Row %d \n", i); exit(0);

Thread Synchronization int pthread_mutex_init(pthread_mutex_t *mutex, pthread_mutexattr_t *attr); int pthread_mutex_lock(pthread_mutex_t *mutex); int pthread_mutex_unlock(pthread_mutex_t *mutex); Example: pthread_mutex_t count_lock; pthread_mutex_init(&count_lock, NULL); pthread_mutex_lock(&count_lock); count++; pthread_mutex_unlock(&count_lock);

Synchronization Example int count; void *sum_row (void *ptr) { int index = 0, sum = 0; int *b = (int *) ptr; while (index < SIZE - 1) { sum += b[index++]; /* sum row */ b[index]=sum; /* store sum in last col. */ count++; printf("%d th thread has finished"); pthread_exit(null); Problem?

Synchronization Example int count; pthread_mutex_t count_lock; void *sum_row (void *ptr) { int index = 0, sum = 0; int *b = (int *) ptr; while (index < SIZE - 1) { sum += b[index++]; /* sum row */ b[index]=sum; /* store sum in last col. */ pthread_mutex_lock(&count_lock); count++; printf("%d th threads has finished"); pthread_mutex_unlock(&count_lock); pthread_exit(null); main() { /*...*/ pthread_mutex_init(&count_lock, NULL);

What is OpenMP? Open specification for Multi-Threaded, Shared Memory Parallelism Standard API for multi-threaded shared-memory programs Preprocessor (compiler) directives Library Calls Environment Variables More info at www.openmp.org

OpenMP Vs. Threads (Supposedly) Better than threads: Simpler programming model Separate a program into serial and parallel regions, rather than T concurrently-executing threads Similar to threads: Programmer must detect dependencies Programmer must prevent data races

Parallel Programming Recipes Threads: Start with a parallel algorithm Implement, keeping in mind: Data races Synchronization Threading Syntax Test Debug Goto step 2 OpenMP: Start with some algorithm Implement serially, ignoring: Data Races Synchronization Threading Syntax Test and Debug Automagically parallelize with relatively few annotations that specify parallelism and synchronization

OpenMP Development Process /* normal C code */ #pragma omp... /* more C code */ Annotated Source OpenMP Compiler Parallel Program

OpenMP Directives Parallelization directives: parallel region parallel for parallel sections Data environment directives: shared, private, threadprivate, reduction, etc. Synchronization directives: barrier, critical

C / C++ Directives Format #pragma omp directive-name [clause,...] \n Case sensitive Long directive lines may be continued on succeding lines by escaping the newline character with a \ at the end of the directive line

General Rules about Directives They always apply to the next statement, which must be a structured block. Examples #pragma omp... statement #pragma omp... { statement1; statement2; statement3;

Parallel Region #pragma omp parallel [clauses] Creates N parallel threads All execute subsequent block All wait for each other at the end of executing the block Barrier synchronization

Barrier

How Many Threads? Determined by following factors, in order of precedence: Use of omp_set_num_threads() library function Setting of the OMP_NUM_THREADS environment variable Implementation default usually the number of CPUs

Parallel Region Example main() { printf( Serial Region 1 ); omp_set_num_threads(4); #pragma omp parallel { printf( Parallel Region ); Fork Join printf( Serial Region 2 ); Output?

Thread Identification 0 Master Thread Fork 0 Master Thread 1 2 3 4 5 6 7 Join 0 Master Thread

Thread Count and Id API #include <omp.h> int omp_get_thread_num(); int omp_get_num_threads(); void omp_set_num_threads(int num); Can also be set using OMP_NUM_THREADS environment variable

Example Usage #pragma omp parallel { if(!omp_get_thread_num() ) master(); else slave();

Work Sharing Directives Always occur within a parallel region Divide the execution of the enclosed code region among the members of the team Do not create new threads Two main directives are parallel for parallel section

Parallel for Fork Join #pragma omp parallel #pragma omp for [clauses] for(... ) {...? Each thread executes a subset of the iterations All threads synchronize at the end of parallel for

Parallel for Restrictions No data dependencies between iterations Program correctness must not depend upon which thread executes a particular iteration

Handy Shortcut #pragma omp parallel #pragma omp for for ( ; ; ) {... is equivalent to #pragma omp parallel for for ( ; ; ) {...

Thread Example Revisited #include <stdlib.h> #include <stdio.h> #include <unistd.h> #include <pthread.h> int buffer[n][size]; void *sum_row (void *ptr) { int index = 0, sum = 0; int *b = (int *) ptr; while (index < SIZE - 1) { sum += b[index++]; /* sum row */ b[index]=sum; /* store sum in last col */ pthread_exit(null); int main (void) { int i,j; pthread_t tid[n]; for (i=0; i<n; i++){ for (j=0; j< SIZE-1; j++) buffer[i][j] =rand()%10; for (i=0; i<n; i++){ if(pthread_create (&tid[i], 0,soma_row, #pragma omp(void parallel *) &(buffer[i]))!= for0) { printf( Error creating thread\n"); exit(-1); for (i=0;i<n;i++) else { printf sum_row(buffer[i]); ( Created thread w/ id %d\n", tid[i]); for (i=0; i<n; i++){ pthread_join (tid[i], NULL); printf ( All threads have concluded\n"); for(i=0; i< N; i++) { for (j=0; j < SIZE; j++) printf (" %d ", buffer[i][j]); printf ( Row %d \n", i); exit(0);

Multiple Work Sharing Directives May occur within the same parallel region #pragma omp parallel { #pragma omp for for( ; ; ) {... #pragma omp for for( ; ; ) {... Implicit barrier at the end of each for

Parallel sections Several blocks are executed in parallel #pragma omp parallel { #pragma omp sections { { a=...; b=...; #pragma omp section delimiter { c=...; d=...; #pragma omp section { e=...; f=...; #pragma omp section { g=...; h=...; /*omp end sections*/ /*omp end parallel*/ a= b= Fork c= d= Join e= f= g= h=

OpenMP Memory Model Concurrent programs access two types of data Shared data, visible to all threads Private data, visible to a single thread (often stack-allocated) Threads: Global variables are shared Local variables are private OpenMP: shared variables are shared private variables are private

OpenMP Memory Model All variables are by default shared. Some exceptions: the loop variable of a parallel for is private stack (local) variables in called subroutines are private By using data directives, some variables can be made private or given other special characteristics.

Private Variables #pragma omp parallel for private( list ) Makes a private copy for each thread for each variable in the list No storage association with original object All references are to the local object Values are undefined on entry and exit Also applies to other region and work-sharing directives

Shared Variables #pragma omp parallel for shared ( list ) Similarly, there is a shared data directive Shared variables exist in a single location and all threads can read and write it It is the programmer s responsibility to ensure that all multiple threads properly access shared variables (will discuss synchronization next)

Example pthreads OpenMP // shared, globals int n, *x, *y; void loop() { // private, stack int i; for (i=0; i<n; i++) x[i] += y[i]; #pragma omp parallel \ { shared(n,x,y) private(i) #pragma omp for for (i=0; i<n; i++) x[i] += y[i]; Could have been replaced with: default(shared) private(i)

About Private Variables As mentioned, values of private variables are undefined on entry and exit A private variable within a region has no storage association with the same variable outside of the region How to override this behavior?

firstprivate / lastprivate Clauses firstprivate (list) Variables in list are initialized with the value the original variable had before entering the parallel construct lastprivate (list) The thread that executes the sequentially last iteration or section updates the value of the variables in list

Example main() { a = 1; #pragma omp parallel { #pragma omp for private(i), firstprivate(a), lastprivate(b) for (i=0; i<n; i++) {... b = a + i; /*-- a undefined, unless declared firstprivate --*/... a = b; /*-- b undefined, unless declared lastprivate --*/ /*-- End of OpenMP parallel region --*/

Threadprivate Variables Private variables are private on a parallel region basis. Threadprivate variables are global variables that are private throughout the execution of the program. #pragma omp threadprivate(x) Initial data is undefined, unless copyin is used

copyin Clause copyin (list) data of the master thread is copied to the threadprivate copies

Example What is the output of the following code? #include <omp.h> int a, b, i, tid; float x; #pragma omp threadprivate(a, x) main () { printf("1st Parallel Region:\n"); #pragma omp parallel private(b,tid) { tid = omp_get_thread_num(); a = tid; b = tid; x = 1.1 * tid +1.0; printf("thread %d: a,b,x= %d %d %f\n",tid,a,b,x); /* end of parallel section */ printf("2nd Parallel Region:\n"); #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf("thread %d: a,b,x = %d %d %f\n",tid,a,b,x); /* end of parallel section */

Thread Synchronization So far, implicit barriers at the end of parallel and all other control constructs Can be removed with nowait clause #pragma omp parallel for nowait {...

Explicit Synchronization Barrier Can be explicitly inserted via barrier directive /* some muti-threaded code */ #pragma omp barrier /* remainder of multi-threaded code */

Explicit Synchronization Critical Section Implements critical sections, similar to mutexes in threads. #pragma omp critical [(name)] {... A thread waits at the beginning of a critical region until no other thread is executing a critical region with the same name. All unnamed critical directives map to the same unspecified name

Critical Sections Useful to avoid data races e.g., multiple threads updating the same variable May introduce a performance bottleneck

Critical Sections Example int cnt = 0; #pragma omp parallel { #pragma omp for for (i=0; i<20; i++) { if (b[i] == 0) { #pragma omp critical { cnt++; /* endif */ a[i] = b[i] * (i+1); /* end for */ /*omp end parallel */ Replace with atomic to define mini-critical section (with a single statement that updates a memory location)

Single Processor Region Ideally suited for I/O or initialization Example: for (i=0;i<n;i++) {... #pragma omp single { read_vector_from_file();... Replace with master to ensure that master thread is chosen

Some Advanced Features Conditional Parallelism Reduction clause Scheduling options

Conditional Parallelism Oftentimes, parallelism is only useful if the problem size is sufficiently big. For smaller sizes, overhead of parallelization exceeds benefit.

Conditional Parallelism #pragma omp parallel if( expression ) #pragma omp for if( expression ) #pragma omp parallel for if( expression ) Execute in parallel if expression evaluates to true, otherwise execute sequentially. Example: for (i=0;i<n;i++) #pragma omp parallel for if(n-i>100) for( j=i+1; j<n; j++ ) for( k=i+1; k<n; k++ ) a[j][k] = a[j][k] - a[i][k]*a[i][j] / a[j][j]

Reduction Clause #pragma omp parallel for reduction(op:list) op is one of +, *, -, &, ^,, &&, or list is a list of shared variables A private copy of each list variable is created for each thread. At the end of the reduction, the reduction operator is applied to all private copies of the variable, and the result is written to the global shared variable

Reduction Example #include <omp.h> main() { int i,n=100; float a[100],b[100],result = 0.0; for (i=0;i<n;i++) { a[i] = i * 1.0; b[i] = i * 2.0; #pragma omp parallel for \ default(shared) private(i) \ reduction(+:result) for(i=0;i<n;i++) result = result + (a[i] * b[i]); printf ("Final result = %f\n",result);

Load Balancing With irregular workloads, care must be taken in distributing the work over the threads Example: Multiplication of two matrices C = A x B, where the A matrix is upper-triangular (all elements below diagonal are 0). A= 0

Matrix Multiply Code #pragma omp parallel for for(i=0; i<n; i++) for(j=0; j<n; j++) { c[i][j] = 0.0; for(k=i; k<n; k++) c[i][j] += a[i][k]*b[k][j];

The schedule Clause schedule (static dynamic guided[,chunk]) schedule (runtime) static [,chunk] Distribute iterations in blocks of size "chunk" over the threads in a round-robin fashion In absence of "chunk", each thread executes approx. N/P chunks for a loop of length N and P threads Example, Loop of length 8, 2 threads: TID No chunk Chunk = 2 0 1-4 1-2, 5-6 1 5-8 3-4, 7-8

The schedule Clause (cont.) dynamic [,chunk] Fixed portions of work; size is controlled by the value of chunk When a thread finishes, it starts on the next portion of work guided [,chunk] Same dynamic behaviour as "dynamic", but size of the portion of work decreases exponentially runtime Iteration scheduling scheme is set at runtime through environment variable OMP_SCHEDULE

Exercise Parallelize a loop with data dependencies double[] V; for (iter=0; iter<numiter; iter++) { for (i=0; i<size-1; i++) { V[i] = f( V[i], V[i+1] );

Exercise (cont.) Incorrect Solution. Why? for (iter=0; iter<numiter; iter++) { /* 3.1. PROCESS ELEMENTS */ #pragma omp parallel for default(none) \ shared(v,totalsize) private(i) schedule(static) for (i=0; i<totalsize-1; i++) { V[i] = f(v[i],v[i+1]); /* 3.2. END ITERATIONS LOOP */

Exercise (cont.) Correct Solution 1. How to avoid the (possibly expensive) array copy? /* 3. ITERATIONS LOOP */ for(iter=0; iter<numiter; iter++) { /* 3.1. DUPLICATE THE FULL ARRAY IN PARALLEL */ #pragma omp parallel for default(none) shared(v,oldv,totalsize)\ private(i) schedule(static) for (i=0; i<totalsize; i++) { oldv[i] = V[i]; /* 3.2. INNER LOOP: PROCESS ELEMENTS IN PARALLEL */ #pragma omp parallel for default(none) shared(v,oldv,totalsize)\ private(i) schedule(static) for (i=0; i<totalsize-1; i++) { V[i] = f(v[i],oldv[i+1]); /* 3.3. END ITERATIONS LOOP */

Exercise (cont.) Correct Solution 2 /* 3. ITERATIONS LOOP */ for(iter=0; iter<numiter; iter++) { /* 3.1. PROCESS IN PARALLEL */ #pragma omp parallel default(none) shared(v,size,nthreads,numiter) private(iter,thread,limitl,limitr,border,i) { /* 3.1.1. GET NUMBER OF THREAD */ thread = omp_get_thread_num(); /* 3.1.2. COMPUTE LIMIT INDEX */ limitl = thread*size; limitr = (thread+1)*size-1; /* 3.1.3. COPY OTHER THREADS's NEIGHBOR ELEMENT */ if (thread!= nthreads) border = V[limitR+1]; /* 3.1.4. SYNCHRONIZE BEFORE UPDATING LOCAL PART */ #pragma omp barrier /* 3.1.5. COMPUTE LOCAL UPDATES */ for (i=limitl; i<limitr; i++) { V[i] = f( V[i], V[i+1] ); /* 3.1.6. COMPUTE LAST ELEMENT (EXCEPT LAST THREAD) */ if (thread!= nthreads-1) V[limitR] = f( V[limitR], border ); /* 3.1.7. END PARALLEL REGION */ /* 3.2. END ITERATIONS LOOP */