Parallel and Distributed Computing

Concurrent Programming with OpenMP Rodrigo Miragaia Rodrigues MSc in Information Systems and Computer Engineering DEA in Computational Engineering CS Department (DEI) Instituto Superior Técnico October 1 and 3, 2007

Parallel Programming How to write a program with concurrent execution flows? Let s revisit what you learned in your OS class [ ] 2 years ago

Threads Process A Shared Code and Global Variables Process B Independent Execution Flows

POSIX Threads (pthreads) int pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine)(void *), void *arg); Example: pthread_t pt_worker; void *thread_function(void *args) { /* thread code */ pthread_create(&pt_worker, NULL, thread_function, (void *)thread_args);

pthreads: Termination and Synchronization int pthread_exit(void *value_ptr) int pthread_join(pthread_t thread, void **value_ptr)

pthread Example: Summing the Values in Matrix Rows #include <stdlib.h> #include <stdio.h> #include <unistd.h> #include <pthread.h> int buffer[n][size]; void *sum_row (void *ptr) { int index = 0, sum = 0; int *b = (int *) ptr; while (index < SIZE - 1) { sum += b[index++]; /* sum row*/ b[index]=sum; /* store sum in last col. */ pthread_exit(null); int main (void) { int i,j; pthread_t tid[n]; for (i=0; i<n; i++){ for (j=0; j< SIZE-1; j++) buffer[i][j] =rand()%10; for (i=0; i<n; i++){ if(pthread_create (&tid[i], 0,sum_row, (void *) &(buffer[i]))!= 0) { printf( Error creating thread\n"); exit(-1); else { printf ( Created thread w/ id %d\n", tid[i]); for (i=0; i<n; i++){ pthread_join (tid[i], NULL); printf ( All threads have concluded\n"); for(i=0; i< N; i++) { for (j=0; j < SIZE; j++) printf (" %d ", buffer[i][j]); printf ( Row %d \n", i); exit(0);

Thread Synchronization int pthread_mutex_init(pthread_mutex_t *mutex, pthread_mutexattr_t *attr); int pthread_mutex_lock(pthread_mutex_t *mutex); int pthread_mutex_unlock(pthread_mutex_t *mutex); Example: pthread_mutex_t count_lock; pthread_mutex_init(&count_lock, NULL); pthread_mutex_lock(&count_lock); count++; pthread_mutex_unlock(&count_lock);

Synchronization Example int count; void *sum_row (void *ptr) { int index = 0, sum = 0; int *b = (int *) ptr; while (index < SIZE - 1) { sum += b[index++]; /* sum row */ b[index]=sum; /* store sum in last col. */ count++; printf("%d th thread has finished"); pthread_exit(null); Problem?

Synchronization Example int count; pthread_mutex_t count_lock; void *sum_row (void *ptr) { int index = 0, sum = 0; int *b = (int *) ptr; while (index < SIZE - 1) { sum += b[index++]; /* sum row */ b[index]=sum; /* store sum in last col. */ pthread_mutex_lock(&count_lock); count++; printf("%d th threads has finished"); pthread_mutex_unlock(&count_lock); pthread_exit(null); main() { /*...*/ pthread_mutex_init(&count_lock, NULL);

What is OpenMP? Open specification for Multi-Threaded, Shared Memory Parallelism Standard API for multi-threaded shared-memory programs Preprocessor (compiler) directives Library Calls Environment Variables More info at www.openmp.org

OpenMP Vs. Threads (Supposedly) Better than threads: Simpler programming model Separate a program into serial and parallel regions, rather than T concurrently-executing threads Similar to threads: Programmer must detect dependencies Programmer must prevent data races

Parallel Programming Recipes Threads: Start with a parallel algorithm Implement, keeping in mind: Data races Synchronization Threading Syntax Test Debug Goto step 2 OpenMP: Start with some algorithm Implement serially, ignoring: Data Races Synchronization Threading Syntax Test and Debug Automagically parallelize with relatively few annotations that specify parallelism and synchronization

OpenMP Development Process /* normal C code */ #pragma omp... /* more C code */ Annotated Source OpenMP Compiler Parallel Program

OpenMP Directives Parallelization directives: parallel region parallel for parallel sections Data environment directives: shared, private, threadprivate, reduction, etc. Synchronization directives: barrier, critical

C / C++ Directives Format #pragma omp directive-name [clause,...] \n Case sensitive Long directive lines may be continued on succeding lines by escaping the newline character with a \ at the end of the directive line

General Rules about Directives They always apply to the next statement, which must be a structured block. Examples #pragma omp... statement #pragma omp... { statement1; statement2; statement3;

Parallel Region #pragma omp parallel [clauses] Creates N parallel threads All execute subsequent block All wait for each other at the end of executing the block Barrier synchronization

Barrier

How Many Threads? Determined by following factors, in order of precedence: Use of omp_set_num_threads() library function Setting of the OMP_NUM_THREADS environment variable Implementation default usually the number of CPUs

Parallel Region Example main() { printf( Serial Region 1 ); omp_set_num_threads(4); #pragma omp parallel { printf( Parallel Region ); Fork Join printf( Serial Region 2 ); Output?

Thread Identification 0 Master Thread Fork 0 Master Thread 1 2 3 4 5 6 7 Join 0 Master Thread

Thread Count and Id API #include <omp.h> int omp_get_thread_num(); int omp_get_num_threads(); void omp_set_num_threads(int num); Can also be set using OMP_NUM_THREADS environment variable

Example Usage #pragma omp parallel { if(!omp_get_thread_num() ) master(); else slave();

Work Sharing Directives Always occur within a parallel region Divide the execution of the enclosed code region among the members of the team Do not create new threads Two main directives are parallel for parallel section

Parallel for Fork Join #pragma omp parallel #pragma omp for [clauses] for(... ) {...? Each thread executes a subset of the iterations All threads synchronize at the end of parallel for

Parallel for Restrictions No data dependencies between iterations Program correctness must not depend upon which thread executes a particular iteration

Handy Shortcut #pragma omp parallel #pragma omp for for ( ; ; ) {... is equivalent to #pragma omp parallel for for ( ; ; ) {...

Thread Example Revisited #include <stdlib.h> #include <stdio.h> #include <unistd.h> #include <pthread.h> int buffer[n][size]; void *sum_row (void *ptr) { int index = 0, sum = 0; int *b = (int *) ptr; while (index < SIZE - 1) { sum += b[index++]; /* sum row */ b[index]=sum; /* store sum in last col */ pthread_exit(null); int main (void) { int i,j; pthread_t tid[n]; for (i=0; i<n; i++){ for (j=0; j< SIZE-1; j++) buffer[i][j] =rand()%10; for (i=0; i<n; i++){ if(pthread_create (&tid[i], 0,soma_row, #pragma omp(void parallel *) &(buffer[i]))!= for0) { printf( Error creating thread\n"); exit(-1); for (i=0;i<n;i++) else { printf sum_row(buffer[i]); ( Created thread w/ id %d\n", tid[i]); for (i=0; i<n; i++){ pthread_join (tid[i], NULL); printf ( All threads have concluded\n"); for(i=0; i< N; i++) { for (j=0; j < SIZE; j++) printf (" %d ", buffer[i][j]); printf ( Row %d \n", i); exit(0);

Multiple Work Sharing Directives May occur within the same parallel region #pragma omp parallel { #pragma omp for for( ; ; ) {... #pragma omp for for( ; ; ) {... Implicit barrier at the end of each for

Parallel sections Several blocks are executed in parallel #pragma omp parallel { #pragma omp sections { { a=...; b=...; #pragma omp section delimiter { c=...; d=...; #pragma omp section { e=...; f=...; #pragma omp section { g=...; h=...; /*omp end sections*/ /*omp end parallel*/ a= b= Fork c= d= Join e= f= g= h=

OpenMP Memory Model Concurrent programs access two types of data Shared data, visible to all threads Private data, visible to a single thread (often stack-allocated) Threads: Global variables are shared Local variables are private OpenMP: shared variables are shared private variables are private

OpenMP Memory Model All variables are by default shared. Some exceptions: the loop variable of a parallel for is private stack (local) variables in called subroutines are private By using data directives, some variables can be made private or given other special characteristics.

Private Variables #pragma omp parallel for private( list ) Makes a private copy for each thread for each variable in the list No storage association with original object All references are to the local object Values are undefined on entry and exit Also applies to other region and work-sharing directives

Shared Variables #pragma omp parallel for shared ( list ) Similarly, there is a shared data directive Shared variables exist in a single location and all threads can read and write it It is the programmer s responsibility to ensure that all multiple threads properly access shared variables (will discuss synchronization next)

Example pthreads OpenMP // shared, globals int n, *x, *y; void loop() { // private, stack int i; for (i=0; i<n; i++) x[i] += y[i]; #pragma omp parallel \ { shared(n,x,y) private(i) #pragma omp for for (i=0; i<n; i++) x[i] += y[i]; Could have been replaced with: default(shared) private(i)

About Private Variables As mentioned, values of private variables are undefined on entry and exit A private variable within a region has no storage association with the same variable outside of the region How to override this behavior?

firstprivate / lastprivate Clauses firstprivate (list) Variables in list are initialized with the value the original variable had before entering the parallel construct lastprivate (list) The thread that executes the sequentially last iteration or section updates the value of the variables in list

Example main() { a = 1; #pragma omp parallel { #pragma omp for private(i), firstprivate(a), lastprivate(b) for (i=0; i<n; i++) {... b = a + i; /*-- a undefined, unless declared firstprivate --*/... a = b; /*-- b undefined, unless declared lastprivate --*/ /*-- End of OpenMP parallel region --*/

Threadprivate Variables Private variables are private on a parallel region basis. Threadprivate variables are global variables that are private throughout the execution of the program. #pragma omp threadprivate(x) Initial data is undefined, unless copyin is used

copyin Clause copyin (list) data of the master thread is copied to the threadprivate copies

Example What is the output of the following code? #include <omp.h> int a, b, i, tid; float x; #pragma omp threadprivate(a, x) main () { printf("1st Parallel Region:\n"); #pragma omp parallel private(b,tid) { tid = omp_get_thread_num(); a = tid; b = tid; x = 1.1 * tid +1.0; printf("thread %d: a,b,x= %d %d %f\n",tid,a,b,x); /* end of parallel section */ printf("2nd Parallel Region:\n"); #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf("thread %d: a,b,x = %d %d %f\n",tid,a,b,x); /* end of parallel section */

Thread Synchronization So far, implicit barriers at the end of parallel and all other control constructs Can be removed with nowait clause #pragma omp parallel for nowait {...

Explicit Synchronization Barrier Can be explicitly inserted via barrier directive /* some muti-threaded code */ #pragma omp barrier /* remainder of multi-threaded code */

Explicit Synchronization Critical Section Implements critical sections, similar to mutexes in threads. #pragma omp critical [(name)] {... A thread waits at the beginning of a critical region until no other thread is executing a critical region with the same name. All unnamed critical directives map to the same unspecified name

Critical Sections Useful to avoid data races e.g., multiple threads updating the same variable May introduce a performance bottleneck

Critical Sections Example int cnt = 0; #pragma omp parallel { #pragma omp for for (i=0; i<20; i++) { if (b[i] == 0) { #pragma omp critical { cnt++; /* endif */ a[i] = b[i] * (i+1); /* end for */ /*omp end parallel */ Replace with atomic to define mini-critical section (with a single statement that updates a memory location)

Single Processor Region Ideally suited for I/O or initialization Example: for (i=0;i<n;i++) {... #pragma omp single { read_vector_from_file();... Replace with master to ensure that master thread is chosen

Some Advanced Features Conditional Parallelism Reduction clause Scheduling options

Conditional Parallelism Oftentimes, parallelism is only useful if the problem size is sufficiently big. For smaller sizes, overhead of parallelization exceeds benefit.

Conditional Parallelism #pragma omp parallel if( expression ) #pragma omp for if( expression ) #pragma omp parallel for if( expression ) Execute in parallel if expression evaluates to true, otherwise execute sequentially. Example: for (i=0;i<n;i++) #pragma omp parallel for if(n-i>100) for( j=i+1; j<n; j++ ) for( k=i+1; k<n; k++ ) a[j][k] = a[j][k] - a[i][k]*a[i][j] / a[j][j]

Reduction Clause #pragma omp parallel for reduction(op:list) op is one of +, *, -, &, ^,, &&, or list is a list of shared variables A private copy of each list variable is created for each thread. At the end of the reduction, the reduction operator is applied to all private copies of the variable, and the result is written to the global shared variable

Reduction Example #include <omp.h> main() { int i,n=100; float a[100],b[100],result = 0.0; for (i=0;i<n;i++) { a[i] = i * 1.0; b[i] = i * 2.0; #pragma omp parallel for \ default(shared) private(i) \ reduction(+:result) for(i=0;i<n;i++) result = result + (a[i] * b[i]); printf ("Final result = %f\n",result);

Load Balancing With irregular workloads, care must be taken in distributing the work over the threads Example: Multiplication of two matrices C = A x B, where the A matrix is upper-triangular (all elements below diagonal are 0). A= 0

Matrix Multiply Code #pragma omp parallel for for(i=0; i<n; i++) for(j=0; j<n; j++) { c[i][j] = 0.0; for(k=i; k<n; k++) c[i][j] += a[i][k]*b[k][j];

The schedule Clause schedule (static dynamic guided[,chunk]) schedule (runtime) static [,chunk] Distribute iterations in blocks of size "chunk" over the threads in a round-robin fashion In absence of "chunk", each thread executes approx. N/P chunks for a loop of length N and P threads Example, Loop of length 8, 2 threads: TID No chunk Chunk = 2 0 1-4 1-2, 5-6 1 5-8 3-4, 7-8

The schedule Clause (cont.) dynamic [,chunk] Fixed portions of work; size is controlled by the value of chunk When a thread finishes, it starts on the next portion of work guided [,chunk] Same dynamic behaviour as "dynamic", but size of the portion of work decreases exponentially runtime Iteration scheduling scheme is set at runtime through environment variable OMP_SCHEDULE

Exercise Parallelize a loop with data dependencies double[] V; for (iter=0; iter<numiter; iter++) { for (i=0; i<size-1; i++) { V[i] = f( V[i], V[i+1] );

Exercise (cont.) Incorrect Solution. Why? for (iter=0; iter<numiter; iter++) { /* 3.1. PROCESS ELEMENTS */ #pragma omp parallel for default(none) \ shared(v,totalsize) private(i) schedule(static) for (i=0; i<totalsize-1; i++) { V[i] = f(v[i],v[i+1]); /* 3.2. END ITERATIONS LOOP */

Exercise (cont.) Correct Solution 1. How to avoid the (possibly expensive) array copy? /* 3. ITERATIONS LOOP */ for(iter=0; iter<numiter; iter++) { /* 3.1. DUPLICATE THE FULL ARRAY IN PARALLEL */ #pragma omp parallel for default(none) shared(v,oldv,totalsize)\ private(i) schedule(static) for (i=0; i<totalsize; i++) { oldv[i] = V[i]; /* 3.2. INNER LOOP: PROCESS ELEMENTS IN PARALLEL */ #pragma omp parallel for default(none) shared(v,oldv,totalsize)\ private(i) schedule(static) for (i=0; i<totalsize-1; i++) { V[i] = f(v[i],oldv[i+1]); /* 3.3. END ITERATIONS LOOP */

Exercise (cont.) Correct Solution 2 /* 3. ITERATIONS LOOP */ for(iter=0; iter<numiter; iter++) { /* 3.1. PROCESS IN PARALLEL */ #pragma omp parallel default(none) shared(v,size,nthreads,numiter) private(iter,thread,limitl,limitr,border,i) { /* 3.1.1. GET NUMBER OF THREAD */ thread = omp_get_thread_num(); /* 3.1.2. COMPUTE LIMIT INDEX */ limitl = thread*size; limitr = (thread+1)*size-1; /* 3.1.3. COPY OTHER THREADS's NEIGHBOR ELEMENT */ if (thread!= nthreads) border = V[limitR+1]; /* 3.1.4. SYNCHRONIZE BEFORE UPDATING LOCAL PART */ #pragma omp barrier /* 3.1.5. COMPUTE LOCAL UPDATES */ for (i=limitl; i<limitr; i++) { V[i] = f( V[i], V[i+1] ); /* 3.1.6. COMPUTE LAST ELEMENT (EXCEPT LAST THREAD) */ if (thread!= nthreads-1) V[limitR] = f( V[limitR], border ); /* 3.1.7. END PARALLEL REGION */ /* 3.2. END ITERATIONS LOOP */