CSL 730: Parallel Programming. OpenMP

CSL 730: Parallel Programming OpenMP

int sum2d(int data[n][n]) { int i,j; #pragma omp parallel for for (i=0; i<n; i++) { int sum = 0; for (j=0; j<n; j++) { sum += data[i][j]; return sum;

Find the Error int sum2d(int data[n][n]) { int i,j; #pragma omp parallel for for (i=0; i<n; i++) { int sum = 0; for (j=0; j<n; j++) { sum += data[i][j]; return sum;

Shared Memory Programming High level language for i=0 to N a[i] = f(b[i], c[i], d[i]) Derive parallelism Generate threads and map to processors Addresses for a, b, c, d accessible to all also the code for f Map i to threadid Impact on cache coherence?

User directed Shared Memory Programming

User directed Shared Memory Programming A way to generate new threads of control

User directed Shared Memory Programming A way to generate new threads of control funcjon per thread?

User directed Shared Memory Programming A way to generate new threads of control funcjon per thread? Work sharing construct?

User directed Shared Memory Programming A way to generate new threads of control funcjon per thread? Work sharing construct? Synchronize

User directed Shared Memory Programming A way to generate new threads of control funcjon per thread? Work sharing construct? Synchronize specify shared variables

User directed Shared Memory Programming A way to generate new threads of control funcjon per thread? Work sharing construct? Synchronize specify shared variables Maybe, for an arbitrary group of threads Ways to map each thread to processor?

Hello OpenMP #pragma omp parallel { // I am now thread i of n switch(omp_get_thread_num()) { case 0 : blah1.. case 1: blah2.. // Back to normal Parallel Construct Extremely simple to use and incredibly powerful Fork- Join model Every thread has its own execujon context Variables can be declared shared or private

ExecuJon Model Encountering thread creates a team: Itself (master) + zero or more addijonal threads. Applies to structured block immediately following Each thread executes separately the code in { But, also see: Work- sharing constructs There s an implicit barrier at the end of block Only master conjnues beyond the barrier May be nested SomeJmes disabled by default

Memory Model NoJon of temporary view of memory Allows local caching Need to relax consistency model Supports threadprivate memory global scope Variables declared before parallel construct: Shared by default May be designated as private n- 1 copies of the original variable is created May not be inijalized by the system

Variable Sharing among Threads Shared: Heap allocated storage StaJc data members const variable (no mutable member) Private: auto Variables declared in a scope inside the construct Loop variable in for construct private to the construct Others are shared unless declared private You can change default Arguments passed by reference inherit from original

Relaxed Consistency Unsynchronized access: If two threads write to the same shared variable the result is undefined If a thread reads and another writes, the read value is undefined Memory atom size is implementation dependent Flush x,y,z.. enforces consistency. Specs say: If the intersection of the flush-sets of two flushes performed by two different threads is nonempty, then the two flushes must be completed as if in some sequential order, seen by all threads. If the intersection of the flush-sets of two flushes performed by one thread is nonempty, then the two flushes must appear to be completed in that thread s program order. 9

Relaxed Consistency Unsynchronized access: If two threads write to the same shared variable the result is undefined T1 writes - > T1 flushes - > T2 flushes - > T2 reads Same order seen by all threads If a thread reads and another writes, the read value is undefined Memory atom size is implementation dependent Flush x,y,z.. enforces consistency. Specs say: If the intersection of the flush-sets of two flushes performed by two different threads is nonempty, then the two flushes must be completed as if in some sequential order, seen by all threads. If the intersection of the flush-sets of two flushes performed by one thread is nonempty, then the two flushes must appear to be completed in that thread s program order. 9

Beware of Compiler Re-ordering a = b = 0 thread 1 thread 2 b = 1 a = 1 if (a == 0) { if (b == 0) { critical section critical section

Beware of Compiler Re-ordering a = b = 0 thread 1 thread 2 b = 1 a = 1 flush(b); flush(a); flush(a); flush(b); if (a == 0) { if (b == 0) { critical section critical section

Beware of Compiler Re-ordering a = b = 0 thread 1 thread 2 b = 1 a = 1 flush(b,a); flush(a,b); if (a == 0) { if (b == 0) { critical section critical section

Thread Control Environment Variable Ways to modify value Way to retrieve value Initial value OMP_NUM_THREADS * omp_set_num_threads omp_get_max_threads Implementation defined OMP_DYNAMIC omp_set_dynamic omp_get_dynamic Implementation defined OMP_NESTED omp_set_nested omp_get_nested false OMP_SCHEDULE * Implementation defined * Also see construct clause: num_threads, schedule

Parallel Construct #pragma omp parallel \ { if(boolean) \ private(var1, var2, var3) \ firstprivate(var1, var2, var3) \ default(private shared none) \ shared(var1, var2) \ copyin(var1, var2) \ reducjon(operator:list) \ num_threads(n) RestricJons: Cannot branch in or out No side effect from clause: must not depend on any ordering of the evaluations Upto one if clause Upto one num_threads clause num_threads must be a +ve integer

What s wrong? int Jd, size; int numprocs = omp_get_num_procs(); #pragma omp parallel num_threads(numprocs) { size = getproblemsize()/numprocs; // assume divisible Jd = omp_get_thread_num(); dotask(jd*size, size); dotask(int start, int count) { // Each thread s instance has its own acjvajon record for(int i = 0, t=start; i< count; i++, t+=1) doit(t); 15

Declare locally (private) int size; int numprocs = omp_get_num_procs(); #pragma omp parallel num_threads(numprocs) { size = getproblemsize()/numprocs; int Jd = omp_get_thread_num(); dotask(jd*size, size); dotask(int start, int count) { // Each thread s instance has its own acjvajon record for(int i = 0, t=start; i< count; i++; t+=1) doit(t); 16

Private clause int Jd, size; int numprocs = omp_get_num_procs(); #pragma omp parallel num_threads(numprocs) private(jd) { size = getproblemsize()/numprocs; Jd = omp_get_thread_num(); dotask(jd*size, size); dotask(int start, int count) { // Each thread s instance has its own acjvajon record for(int i = 0, t=start; i< count; i++; t+=stride) doit(t); 17

Parallel Loop #pragma omp parallel for for (i= 0; i < N; ++i) { blah Num of iterajons must be known when the construct is encountered Must be the same for each encountering thread Compiler puts a barrier at the end of parallel for But see nowait

Parallel For Construct #pragma omp for \ private(var1, var2, var3) \ firstprivate(var1, var2, var3)\ lastprivate(var1, var2) \ reducjon(operator: list) \ ordered \ schedule(kind[,chunk_size])\ nowait Canonical For Loop No break RestricJons: same loop control expression for all threads in the team. At most one schedule, nowait, ordered clause chunk_size must be a loop/construct invariant, +ve integer ordered clause required if any ordered region inside

Firstprivate and Lastprivate IniJal value of private variable is unspecified firstprivate inijalizes copies with the original Once per thread (not once per iterajon) Original exists before the construct The original copy lives aser the construct lastprivate forces sequenjal- like behavior thread execujng the sequenjally last iterajon (or last listed secjon) writes to the original copy

Firstprivate and Lastprivate #pragma omp parallel for firstprivate( simple ) for (int i=0; i<n; i++) { simple += a[f1(i, omp_get_thread_num())] f2(simple); #pragma omp parallel for lastprivate( doneearly ) for( i=0; i<n; i++ ) { doneearly = f0(i);

Private clause int Jd, size; int numprocs = omp_get_num_procs(); #pragma omp parallel num_threads(numprocs) private(jd) { size = getproblemsize()/numprocs; Jd = omp_get_thread_num(); dotask(jd*size, size); Remember this code? dotask(int start, int count) { // Each thread s instance has its own acjvajon record for(int i = 0, t=start; i< count; i++; t+=stride) doit(t); 22

Work Sharing for #pragma omp parallel for { for(int i=0; i<problemsize; i++) doit(i); 23

Work Sharing for #pragma omp parallel for { for(int i=0; i<problemsize; i++) doit(i); Works even if number of tasks is not divisible by number of threads 23

Schedule(kind[, chunk_size]) Divide iterajons into conjguous sets, chunks chunks are assigned transparently to threads sta.c: chunks are assigned in a round- robin fashion default chunk_size is roughly Load/num_threads dynamic: chunks are assigned to threads as requested default chunk_size is 1 guided: dynamic, with chunk size proporjonal to #unassigned iterajons divided by num_threads chunk size is at least chunk_size iterajons (except the last) default chunk_size is 1 run.me: taken directly from environment variable

ReducJons Reductions are common scalar f(v1.. vn) Specify reduction operation and variable OpenMP code combines results from the loop stores partial results in private variables

reducjon Clause reduction (<op> :<variable>) + Sum * Product & Bitwise and Bitwise or ^ Bitwise exclusive or && Logical and Logical or Add to parallel for OpenMP creates a loop to combine copies of the variable The resuljng loop may not be parallel

Single Construct #pragma omp parallel { #pragma omp for for( int i=0; i<n; i++ ) a[i] = f0(i); #pragma omp single x = f1(a); #pragma omp for for(int i=0; i<n; i++ ) b[i] = x * f2(i); Only one of the threads executes Other threads wait for it unless NOWAIT is specified Hidden complexity Threads may not hit single

SecJons Construct #pragma omp secjons { #pragma omp secjon { // do this #pragma omp secjon { // do that // omp sec.on pragma must be closely nested in a secjons construct, where no other work- sharing construct may appear.

Other SynchronizaJon DirecJves #pragma omp master { binds to the innermost enclosing parallel region Only the master executes No implied barrier

Master DirecJve #pragma omp parallel { #pragma omp for for( int i=0; i<100; i++ ) a[i] = f0(i); #pragma omp master x = f1(a); Only master executes. No synchronizajon.

CriJcal SecJon #pragma omp crijcal (accessbankbalance) { A single thread at a Jme through all regions of the same name Applies to all threads The name is opjonal Anonymous = global crijcal region

Barrier DirecJve #pragma omp barrier Stand- alone Binds to inner- most parallel region All threads in the team must execute they will all wait for each other at this instrucjon Dangerous: if (! ready) #pragma omp barrier Same sequence of work- sharing and barrier for the enjre team

Ordered DirecJve #pragma omp ordered { Binds to inner-most enclosing loop The structured block executed in loop sequential order The loop must declare the ordered clause Each thread must encounter only one ordered region

Flush DirecJve #pragma omp flush (var1, var2) Stand- alone, like barrier Only directly affects the encountering thread List- of- vars ensures that any compiler re- ordering moves all flushes together implicit: barrier, atomic, crijcal, locks

Atomic DirecJve #pragma omp atomic i++; Light- weight crijcal secjon Only for some expressions x = expr (no mutual exclusion on expr evaluajon) x++ ++x x- - - - x

Helper Functions: General void omp_set_dynamic (int); int omp_get_dynamic (); void omp_set_nested (int); int omp_get_nested (); int omp_get_num_procs(); int omp_get_num_threads(); int omp_get_thread_num(); int omp_get_ancestor_thread_num(); double omp_get_wtime(); 36

Helper Functions: Mutex void omp_init_lock (omp_lock_t *); void omp_destroy_lock (omp_lock_t *); void omp_set_lock (omp_lock_t *); void omp_unset_lock (omp_lock_t *); int omp_test_lock (omp_lock_t *); nested lock versions: e.g., omp_set_nest_lock(omp_test_lock_t *); 37

NesJng RestricJons A crijcal region may not be nested ever inside a crijcal region with the same name Not sufficient to prevent deadlock Not allowed without intervening parallel region: Inside work- sharing, crijcal, ordered, or master Work- sharing barrier Inside a work- sharing region master region Inside a crijcal region ordered region

EXAMPLES

Ordered Construct int i; #pragma omp for ordered for (i=0; i<n; i++) { if(isgroupa(i) { #pragma omp ordered doit(i); else { #pragma omp ordered doit(partner(i)); 41

Wrong Use of multiple Orders int i; #pragma omp for ordered for (i=0; i<n; i++) { #pragma omp ordered doit(i); #pragma omp ordered doit(partner(i)); 42

OpenMP Matrix MulJply

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ ) for( int j=0; j<n; j++ ) { c[i][j] = 0.0; for(int k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j];

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ ) for( int j=0; j<n; j++ ) { c[i][j] = 0.0; for(int k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; a, b, c are shared

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ ) for( int j=0; j<n; j++ ) { c[i][j] = 0.0; for(int k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; a, b, c are shared i, j, k are private

OpenMP Matrix MulJply

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ ) #pragma omp parallel for for( int j=0; j<n; j++ ) { c[i][j] = 0.0; for(int k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j];

OpenMP Matrix MulJply

OpenMP Matrix MulJply #pragma omp parallel for

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ )

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ ) #pragma omp parallel for

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ ) #pragma omp parallel for for( int j=0; j<n; j++ ) {

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ ) #pragma omp parallel for for( int j=0; j<n; j++ ) { int sum = 0.0;

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ ) #pragma omp parallel for for( int j=0; j<n; j++ ) { int sum = 0.0; #pragma omp parallel for reducjon(+:sum)

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ ) #pragma omp parallel for for( int j=0; j<n; j++ ) { int sum = 0.0; #pragma omp parallel for reducjon(+:sum) for(int k=0; k<n; k++ )

OpenMP Matrix Multiply: Triangular #pragma omp parallel for schedule (dynamic, 1 ) for( int i=0; i<n; i++ ) for( int j=i; j<n; j++ ) { c[i][j] = 0.0; for(int k=i; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; This multiplies upper-triangular matrix A with B Unbalanced workload Schedule improves this

OpenMP Jacobi for some number of Jmesteps/iteraJons { #pragma omp parallel for for (int i=0; i<n; i++ ) for( int j=0, j<n, j++ ) temp[i][j] = 0.25 * ( grid[i- 1][j] + grid[i+1][j] grid[i][j- 1] + grid[i][j+1] ); #pragma omp parallel for for( int i=0; i<n; i++ ) for( int j=0; j<n; j++ ) grid[i][j] = temp[i][j]; This could be improved by using just one parallel region Implicit barrier after loops eliminates race on grid

#pragma omp parallel shared(a, b, nthreads, locka, lockb) private(tid) #pragma omp sections nowait { #pragma omp section { omp_set_lock(&locka); for (i=0; i<n; i++) a[i] = i * DELTA; omp_set_lock(&lockb); for (i=0; i<n; i++) b[i] += a[i]; omp_unset_lock(&lockb); omp_unset_lock(&locka); #pragma omp section { omp_set_lock(&lockb); for (i=0; i<n; i++) b[i] = i * PI; omp_set_lock(&locka); for (i=0; i<n; i++) a[i] += b[i]; omp_unset_lock(&locka); omp_unset_lock(&lockb); /* end of sections */ Find the Error Assume: variables declared locks inijalized 48

void worksum(float *x, float *y, int *index, int n) { int i; #pragma omp parallel for shared(x, y, index, n) for (i=0; i<n; i++) { #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); int work0 = x[0]

Find the Error void worksum(float *x, float *y, int *index, int n) { int i; #pragma omp parallel for shared(x, y, index, n) for (i=0; i<n; i++) { #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); int work0 = x[0] nowait

Efficiency Issues Minimize synchronizajon Avoid BARRIER, CRITICAL, ORDERED, and locks Use NOWAIT Use named CRITICAL secjons for fine- grained locking Use MASTER (instead of SINGLE) Parallelize at the highest level possible such as outer FOR loops keep parallel regions large FLUSH is expensive LASTPRIVATE has synchronizajon overhead Thread safe malloc/free are expensive Reduce False sharing Design of data structures Use PRIVATE

Common SMP Errors SynchronizaJon Race condijon depends on Jming Deadlock waijng for a non- existent condijon Livelock conjnuously adjusjng, but task progress stalled Try to Avoid nested locks Release locks religiously Avoid while true (especially, during tesjng) Be careful with Non thread- safe libraries Concurrent access to shared data IO inside parallel regions Differing views of shared memory (FLUSH) NOWAIT