CSL 860: Modern Parallel Computation
Hello OpenMP #pragma omp parallel { // I am now thread iof n switch(omp_get_thread_num()) { case 0 : blah1.. case 1: blah2.. // Back to normal Parallel Construct Extremely simple to use and incredibly powerful Fork-Join model Every thread has its own execution context Variables can be declared shared or private
Execution Model Encountering thread creates a team: Itself (master) + zero or more additional threads. Applies to structured block immediately following Each thread executes a copy of the code in { But, also see: Work-sharing constructs There s an implicit barrierat the end of block Only master continues beyond the barrier May be nested Sometimes disabled by default
Memory Model Notion of temporary viewof memory Allows local caching Need to flush memory T1 writes -> T1 flushes -> T2 flushes -> T2 reads Same order seen by all threads Supports threadprivate memory Variables declared before parallel construct: Shared by default May be designated as private n-1 copies of the original variable is created May not be initialized by the system
Shared Variables Heap allocated storage Static data members const-qualified (no mutable members) Private: Variables declared in a scope inside the construct Loop variable inforconstruct private to the construct Others are shared unless declared private You can change default Arguments passed by reference inherit from original
Beware of Compiler Re-ordering a = b = 0 thread 1 thread 2 b = 1 a = 1 flush(b); flush(a); flush(a); flush(b); if (a == 0) { if (b == 0) { critical section critical section
Beware more of Compiler Re-ordering // Parallel construct { int b = initialsalary print( Initial Salary was %d\n, initialsalary); Book-keeping() // No read b or write initialsalary if (b < 10000) { raisesalary(500);
Thread Control E nvironment Variable Ways to modify value Way to retrieve value Initial value OMP_NUM_THREADS * omp_set_num_threads omp_get_max_threads Implementation defined OMP_DYNAMIC omp_set_dynamic omp_get_dynamic Implementation defined OMP_NESTED omp_set_nested omp_get_nested false OMP_SCHEDULE * Implementation defined * Also see construct clause: num_threads, schedule
#pragma omp parallel \ if(boolean) \ private(var1, var2, var3) \ { Parallel Construct firstprivate(var1, var2, var3) \ default(shared none) \ shared(var1, var2), \ copyin(var2), \ reduction(operator:list) \ num_threads(n)
Parallel Loop #pragma omp parallel for for (i= 0; i < N; ++i) { blah No of iterations must be known when the construct is encountered Must be the same for each thread Compiler puts a barrier at the end of parallel for But see nowait
Parallel For #pragmaompfor \ private(var1, var2, var3) \ firstprivate(var1, var2, var3) \ lastprivate(var1, var2), \ reduction(operator: list), \ ordered, \ schedule(kind[, chunk_size]), \ nowait Canonical For Loop No loop break
Schedule(kind[, chunk_size]) Divide iterations into contiguous sets, chunks chunks are assigned transparently to threads static: iterations are divided among threads in a round-robin fashion When no chunk_sizeis specified, approximately equal chunks are made dynamic: iterations are assigned to threads in request order When no chunk_size is specified, it defaults to 1. guided: like dynamic, the size of each chunk is proportional to the number of unassigned iterations divided by the number of threads If chunk_size =k, chunks have at least k iterations (except the last) When no chunk_size is specified, it defaults to 1. runtime: taken from environment variable
Single #pragma omp parallel { #pragmaompfor for( inti=0; i<n; i++ ) a[i] = f0(i); #pragma omp single x = f1(a); #pragmaompfor for(int i=0; i<n; i++ ) b[i] = x * f2(i); Only one of the threads executes Other threads wait for it unless NOWAIT is specified Hidden complexity Threads may be at different instructions
Sections #pragma omp sections { #pragma omp section { // do this #pragma omp section { // do that // The ompsection directives must be closely nested in a sectionsconstruct, where no other work-sharing construct may appear.
Private Variables #pragma omp parallel private (size, ) for for ( inti= 0; i= numthreads; i++) { int size = numtasks/numthreads; int extra = numtasks numthreads*size; if(i< extra) size ++; dotask(i, size, numthreads); dotask(intstart, intcount) { // Each thread s instance has its own activation record for(int i = 0, t=start; i< count; i++; t+=stride) doit(t);
Firstprivateand Lastprivate Initial value of private variable is unspecified firstprivate initializes copies with the original Once per thread (not once per iteration) Original exists before the construct Only the original copy is retained after the construct lastprivate forces sequential-like behavior thread executing the sequentially last iteration (or last listed section) writes to the original copy
Firstprivateand Lastprivate #pragma omp parallel for firstprivate( simple ) for (inti=0; i<n; i++) { simple += a[f1(i, omp_get_thread_num())] f2(simple); #pragma omp parallel for lastprivate( doneearly) for( i=0; (i<n doneearly; i++ ) { doneearly = f0(i);
Other Synchronization Directives #pragma omp master { binds to the innermost enclosing parallel region Only the master executes No implied barrier
Master Directive #pragma omp parallel { #pragmaompfor for( inti=0; i<100; i++ ) a[i] = f0(i); #pragma omp master x = f1(a); Only master executes. No synchronization.
Critical Section #pragma omp critical accessbankbalance { A single thread at a time Applies to all threads The name is optional; no name implies global critical region
Barrier Directive #pragma omp barrier Stand-alone Binds to inner-most parallel region All threads in the team must execute they will all wait for each other at this instruction Dangerous: if (! ready) #pragma omp barrier Same sequence of work-sharing and barrier for the entire team
Ordered Directive #pragma omp ordered { Binds to inner-most enclosing loop The structured block executed in sequential order The loop must declare the ordered clause May encounter only one ordered regions
Flush Directive #pragma omp flush (var1, var2) Stand-alone, like barrier Only directly affects the encountering thread List-of-vars ensures that any compiler re-ordering List-of-vars ensures that any compiler re-ordering moves all flushes together
Atomic Directive #pragma omp atomic i++; Light-weight critical section Only for some expressions x = expr (no mutual exclusion on exprevaluation) x++ ++x x-- --x
Reductions Reductions are so common that OpenMPprovides support for them May add reduction clause to parallel for pragma Specify reduction operation and reduction variable OpenMPtakes care of storing partial results in private variables and combining partial results after the loop
reduction Clause reduction (<op> :<variable>) + Sum * Product & Bitwise and Bitwise or ^ Bitwise exclusive or && Logical and Logical or Add to parallel for OpenMPcreates a loop to combine copies of the variable The resulting loop may not be parallel
Nesting Restrictions A work-sharing region may not be closely nested inside a work-sharing, critical, ordered, or master region. A barrier region may not be closely nested inside a worksharing, critical, ordered, or master region. A master region may not be closely nested inside a work- sharing region. An ordered region may not be closely nested inside a critical region. An ordered region must be closely nested inside a loop region (or parallel loop region) with an ordered clause. A critical region may not be nested (closely or otherwise) inside a critical region with the same name. Note that this restriction is not sufficient to prevent deadlock
EXAMPLES
OpenMPMatrix Multiply #pragma omp parallel for for(inti=0; i<n; i++ ) for( intj=0; j<n; j++ ){ c[i][j] = 0.0; for(int k=0; k<n; k++ ) a, b, c are shared i, j, k are private c[i][j] += a[i][k]*b[k][j];
OpenMPMatrix Multiply: Triangular #pragma omp parallel for schedule (dynamic, 1 ) for( inti=0; i<n; i++ ) for( intj=i; j<n; j++ ){ c[i][j] = 0.0; for(int k=i; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; This multiplies upper-triangular matrix A with B Unbalanced workload Schedule improves this
OpenMP Jacobi for some number of timesteps/iterations { #pragma omp parallel for for (inti=0; i<n; i++ ) for( intj=0, j<n, j++ ) temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); #pragma omp parallel for for( inti=0; i<n; i++ ) for( intj=0; j<n; j++ ) grid[i][j] = temp[i][j]; This could be improved by using just one parallel region Implicit barrier after loops eliminates race on grid
OpenMP Jacobi for some number of timesteps/iterations { #pragma omp parallel for for (inti=0; i<n; i++ ) for( intj=0, j<n, j++ ) { temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); #pragma omp barrier grid[i][j] = temp[i][j]; Is barrier sufficient? What change to the code is needed? Recall barrier is per-team