CSL 730: Parallel Programming. OpenMP

Similar documents
CSL 860: Modern Parallel

COL 730: Parallel Programming. OpenMP

Data Handling in OpenMP

Parallel and Distributed Programming. OpenMP

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

PC to HPC. Xiaoge Wang ICER Jan 27, 2016

OpenMP Application Program Interface

OPENMP OPEN MULTI-PROCESSING

Shared Memory Parallelism - OpenMP

Parallel Programming using OpenMP

Parallel Programming using OpenMP

Introduction [1] 1. Directives [2] 7

Introduction to. Slides prepared by : Farzana Rahman 1

OpenMP Application Program Interface

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to OpenMP. Rogelio Long CS 5334/4390 Spring 2014 February 25 Class

HPCSE - I. «OpenMP Programming Model - Part I» Panos Hadjidoukas

Shared Memory Parallelism using OpenMP

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Data Environment: Default storage attributes

EPL372 Lab Exercise 5: Introduction to OpenMP

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Parallel Programming with OpenMP. CS240A, T. Yang

OpenMP Language Features

ECE 574 Cluster Computing Lecture 10

Multi-core Architecture and Programming

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Alfio Lazzaro: Introduction to OpenMP

Introduction to OpenMP.

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

Session 4: Parallel Programming with OpenMP

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Review. Tasking. 34a.cpp. Lecture 14. Work Tasking 5/31/2011. Structured block. Parallel construct. Working-Sharing contructs.

Introduction to Standard OpenMP 3.1

Introduction to OpenMP

Programming Shared-memory Platforms with OpenMP. Xu Liu

An Introduction to OpenMP

Parallel Programming

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Concurrent Programming with OpenMP

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Programming Shared Address Space Platforms using OpenMP

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Introduction to OpenMP

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell

Computational Mathematics

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Overview: The OpenMP Programming Model

Shared Memory Programming Models I

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming in C with MPI and OpenMP

[Potentially] Your first parallel application

OpenMP Technical Report 3 on OpenMP 4.0 enhancements

Parallel Computing Parallel Programming Languages Hwansoo Han

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

Introduction to OpenMP

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Parallel Programming in C with MPI and OpenMP

Allows program to be incrementally parallelized

Parallel programming using OpenMP

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

CS 61C: Great Ideas in Computer Architecture Lecture 20: Thread- Level Parallelism (TLP) and OpenMP Part 2

Programming with OpenMP*

Multithreading in C with OpenMP

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Introduction to OpenMP

Practical in Numerical Astronomy, SS 2012 LECTURE 12

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

Parallel and Distributed Computing

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

Shared memory programming

OpenMP. Table of Contents

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Parallel Programming with OpenMP

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Introduction to OpenMP. Motivation

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Shared Memory Parallelism


Parallel Programming in C with MPI and OpenMP

ME964 High Performance Computing for Engineering Applications

A brief introduction to OpenMP

Distributed Systems + Middleware Concurrent Programming with OpenMP

OpenMP Library Functions and Environmental Variables. Most of the library functions are used for querying or managing the threading environment

Synchronisation in Java - Java Monitor

DPHPC: Introduction to OpenMP Recitation session

OpenMP loops. Paolo Burgio.

Lecture 4: OpenMP Open Multi-Processing

An Introduction to OpenMP

Transcription:

CSL 730: Parallel Programming OpenMP

int sum2d(int data[n][n]) { int i,j; #pragma omp parallel for for (i=0; i<n; i++) { int sum = 0; for (j=0; j<n; j++) { sum += data[i][j]; return sum;

Find the Error int sum2d(int data[n][n]) { int i,j; #pragma omp parallel for for (i=0; i<n; i++) { int sum = 0; for (j=0; j<n; j++) { sum += data[i][j]; return sum;

Shared Memory Programming High level language for i=0 to N a[i] = f(b[i], c[i], d[i]) Derive parallelism Generate threads and map to processors Addresses for a, b, c, d accessible to all also the code for f Map i to threadid Impact on cache coherence?

User directed Shared Memory Programming

User directed Shared Memory Programming A way to generate new threads of control

User directed Shared Memory Programming A way to generate new threads of control funcjon per thread?

User directed Shared Memory Programming A way to generate new threads of control funcjon per thread? Work sharing construct?

User directed Shared Memory Programming A way to generate new threads of control funcjon per thread? Work sharing construct? Synchronize

User directed Shared Memory Programming A way to generate new threads of control funcjon per thread? Work sharing construct? Synchronize specify shared variables

User directed Shared Memory Programming A way to generate new threads of control funcjon per thread? Work sharing construct? Synchronize specify shared variables Maybe, for an arbitrary group of threads

User directed Shared Memory Programming A way to generate new threads of control funcjon per thread? Work sharing construct? Synchronize specify shared variables Maybe, for an arbitrary group of threads Ways to map each thread to processor?

User directed Shared Memory Programming A way to generate new threads of control funcjon per thread? Work sharing construct? Synchronize specify shared variables Maybe, for an arbitrary group of threads Ways to map each thread to processor? May have more threads than processors

User directed Shared Memory Programming A way to generate new threads of control funcjon per thread? Work sharing construct? Synchronize specify shared variables Maybe, for an arbitrary group of threads Ways to map each thread to processor? May have more threads than processors Need high level constructs for all these

Hello OpenMP #pragma omp parallel { // I am now thread i of n switch(omp_get_thread_num()) { case 0 : blah1.. case 1: blah2.. // Back to normal Parallel Construct Extremely simple to use and incredibly powerful Fork- Join model Every thread has its own execujon context Variables can be declared shared or private

ExecuJon Model Encountering thread creates a team: Itself (master) + zero or more addijonal threads. Applies to structured block immediately following Each thread executes separately the code in { But, also see: Work- sharing constructs There s an implicit barrier at the end of block Only master conjnues beyond the barrier May be nested SomeJmes disabled by default

Memory Model NoJon of temporary view of memory Allows local caching Need to relax consistency model Supports threadprivate memory global scope Variables declared before parallel construct: Shared by default May be designated as private n- 1 copies of the original variable is created May not be inijalized by the system

Variable Sharing among Threads Shared: Heap allocated storage StaJc data members const variable (no mutable member) Private: auto Variables declared in a scope inside the construct Loop variable in for construct private to the construct Others are shared unless declared private You can change default Arguments passed by reference inherit from original

Relaxed Consistency Unsynchronized access: If two threads write to the same shared variable the result is undefined If a thread reads and another writes, the read value is undefined Memory atom size is implementation dependent Flush x,y,z.. enforces consistency. Specs say: If the intersection of the flush-sets of two flushes performed by two different threads is nonempty, then the two flushes must be completed as if in some sequential order, seen by all threads. If the intersection of the flush-sets of two flushes performed by one thread is nonempty, then the two flushes must appear to be completed in that thread s program order. 9

Relaxed Consistency Unsynchronized access: If two threads write to the same shared variable the result is undefined T1 writes - > T1 flushes - > T2 flushes - > T2 reads Same order seen by all threads If a thread reads and another writes, the read value is undefined Memory atom size is implementation dependent Flush x,y,z.. enforces consistency. Specs say: If the intersection of the flush-sets of two flushes performed by two different threads is nonempty, then the two flushes must be completed as if in some sequential order, seen by all threads. If the intersection of the flush-sets of two flushes performed by one thread is nonempty, then the two flushes must appear to be completed in that thread s program order. 9

Beware of Compiler Re-ordering a = b = 0 thread 1 thread 2 b = 1 a = 1 if (a == 0) { if (b == 0) { critical section critical section

Beware of Compiler Re-ordering a = b = 0 thread 1 thread 2 b = 1 a = 1 if (a == 0) { if (b == 0) { critical section critical section

Beware of Compiler Re-ordering a = b = 0 thread 1 thread 2 b = 1 a = 1 flush(b); flush(a); flush(a); flush(b); if (a == 0) { if (b == 0) { critical section critical section

Beware of Compiler Re-ordering a = b = 0 thread 1 thread 2 b = 1 a = 1 flush(b); flush(a); flush(a); flush(b); if (a == 0) { if (b == 0) { critical section critical section

Beware of Compiler Re-ordering a = b = 0 thread 1 thread 2 b = 1 a = 1 flush(b,a); flush(a,b); if (a == 0) { if (b == 0) { critical section critical section

Thread Control Environment Variable Ways to modify value Way to retrieve value Initial value OMP_NUM_THREADS * omp_set_num_threads omp_get_max_threads Implementation defined OMP_DYNAMIC omp_set_dynamic omp_get_dynamic Implementation defined OMP_NESTED omp_set_nested omp_get_nested false OMP_SCHEDULE * Implementation defined * Also see construct clause: num_threads, schedule

Parallel Construct #pragma omp parallel \ { if(boolean) \ private(var1, var2, var3) \ firstprivate(var1, var2, var3) \ default(private shared none) \ shared(var1, var2) \ copyin(var1, var2) \ reducjon(operator:list) \ num_threads(n)

Parallel Construct #pragma omp parallel \ { if(boolean) \ private(var1, var2, var3) \ firstprivate(var1, var2, var3) \ default(private shared none) \ shared(var1, var2) \ copyin(var1, var2) \ reducjon(operator:list) \ num_threads(n) RestricJons: Cannot branch in or out No side effect from clause: must not depend on any ordering of the evaluations Upto one if clause Upto one num_threads clause num_threads must be a +ve integer

What s wrong? int Jd, size; int numprocs = omp_get_num_procs(); #pragma omp parallel num_threads(numprocs) { size = getproblemsize()/numprocs; // assume divisible Jd = omp_get_thread_num(); dotask(jd*size, size); dotask(int start, int count) { // Each thread s instance has its own acjvajon record for(int i = 0, t=start; i< count; i++, t+=1) doit(t); 15

What s wrong? int Jd, size; int numprocs = omp_get_num_procs(); #pragma omp parallel num_threads(numprocs) { size = getproblemsize()/numprocs; // assume divisible Jd = omp_get_thread_num(); dotask(jd*size, size); dotask(int start, int count) { // Each thread s instance has its own acjvajon record for(int i = 0, t=start; i< count; i++, t+=1) doit(t); 15

Declare locally (private) int size; int numprocs = omp_get_num_procs(); #pragma omp parallel num_threads(numprocs) { size = getproblemsize()/numprocs; int Jd = omp_get_thread_num(); dotask(jd*size, size); dotask(int start, int count) { // Each thread s instance has its own acjvajon record for(int i = 0, t=start; i< count; i++; t+=1) doit(t); 16

Private clause int Jd, size; int numprocs = omp_get_num_procs(); #pragma omp parallel num_threads(numprocs) private(jd) { size = getproblemsize()/numprocs; Jd = omp_get_thread_num(); dotask(jd*size, size); dotask(int start, int count) { // Each thread s instance has its own acjvajon record for(int i = 0, t=start; i< count; i++; t+=stride) doit(t); 17

Parallel Loop #pragma omp parallel for for (i= 0; i < N; ++i) { blah Num of iterajons must be known when the construct is encountered Must be the same for each encountering thread Compiler puts a barrier at the end of parallel for But see nowait

Parallel For Construct #pragma omp for \ private(var1, var2, var3) \ firstprivate(var1, var2, var3)\ lastprivate(var1, var2) \ reducjon(operator: list) \ ordered \ schedule(kind[,chunk_size])\ nowait Canonical For Loop

Parallel For Construct #pragma omp for \ private(var1, var2, var3) \ firstprivate(var1, var2, var3)\ lastprivate(var1, var2) \ reducjon(operator: list) \ ordered \ schedule(kind[,chunk_size])\ nowait Canonical For Loop No break

Parallel For Construct #pragma omp for \ private(var1, var2, var3) \ firstprivate(var1, var2, var3)\ lastprivate(var1, var2) \ reducjon(operator: list) \ ordered \ schedule(kind[,chunk_size])\ nowait Canonical For Loop No break RestricJons: same loop control expression for all threads in the team. At most one schedule, nowait, ordered clause chunk_size must be a loop/construct invariant, +ve integer ordered clause required if any ordered region inside

Firstprivate and Lastprivate IniJal value of private variable is unspecified firstprivate inijalizes copies with the original Once per thread (not once per iterajon) Original exists before the construct The original copy lives aser the construct lastprivate forces sequenjal- like behavior thread execujng the sequenjally last iterajon (or last listed secjon) writes to the original copy

Firstprivate and Lastprivate #pragma omp parallel for firstprivate( simple ) for (int i=0; i<n; i++) { simple += a[f1(i, omp_get_thread_num())] f2(simple); #pragma omp parallel for lastprivate( doneearly ) for( i=0; i<n; i++ ) { doneearly = f0(i);

Private clause int Jd, size; int numprocs = omp_get_num_procs(); #pragma omp parallel num_threads(numprocs) private(jd) { size = getproblemsize()/numprocs; Jd = omp_get_thread_num(); dotask(jd*size, size); Remember this code? dotask(int start, int count) { // Each thread s instance has its own acjvajon record for(int i = 0, t=start; i< count; i++; t+=stride) doit(t); 22

Work Sharing for #pragma omp parallel for { for(int i=0; i<problemsize; i++) doit(i); 23

Work Sharing for #pragma omp parallel for { for(int i=0; i<problemsize; i++) doit(i); Works even if number of tasks is not divisible by number of threads 23

Schedule(kind[, chunk_size]) Divide iterajons into conjguous sets, chunks chunks are assigned transparently to threads sta.c: chunks are assigned in a round- robin fashion default chunk_size is roughly Load/num_threads dynamic: chunks are assigned to threads as requested default chunk_size is 1 guided: dynamic, with chunk size proporjonal to #unassigned iterajons divided by num_threads chunk size is at least chunk_size iterajons (except the last) default chunk_size is 1 run.me: taken directly from environment variable

ReducJons Reductions are common scalar f(v1.. vn) Specify reduction operation and variable OpenMP code combines results from the loop stores partial results in private variables

reducjon Clause reduction (<op> :<variable>) + Sum * Product & Bitwise and Bitwise or ^ Bitwise exclusive or && Logical and Logical or Add to parallel for OpenMP creates a loop to combine copies of the variable The resuljng loop may not be parallel

Single Construct #pragma omp parallel { #pragma omp for for( int i=0; i<n; i++ ) a[i] = f0(i); #pragma omp single x = f1(a); #pragma omp for for(int i=0; i<n; i++ ) b[i] = x * f2(i); Only one of the threads executes Other threads wait for it unless NOWAIT is specified Hidden complexity Threads may not hit single

SecJons Construct #pragma omp secjons { #pragma omp secjon { // do this #pragma omp secjon { // do that // omp sec.on pragma must be closely nested in a secjons construct, where no other work- sharing construct may appear.

Other SynchronizaJon DirecJves #pragma omp master { binds to the innermost enclosing parallel region Only the master executes No implied barrier

Master DirecJve #pragma omp parallel { #pragma omp for for( int i=0; i<100; i++ ) a[i] = f0(i); #pragma omp master x = f1(a); Only master executes. No synchronizajon.

CriJcal SecJon #pragma omp crijcal (accessbankbalance) { A single thread at a Jme through all regions of the same name Applies to all threads The name is opjonal Anonymous = global crijcal region

Barrier DirecJve #pragma omp barrier Stand- alone Binds to inner- most parallel region All threads in the team must execute they will all wait for each other at this instrucjon Dangerous: if (! ready) #pragma omp barrier Same sequence of work- sharing and barrier for the enjre team

Ordered DirecJve #pragma omp ordered { Binds to inner-most enclosing loop The structured block executed in loop sequential order The loop must declare the ordered clause Each thread must encounter only one ordered region

Flush DirecJve #pragma omp flush (var1, var2) Stand- alone, like barrier Only directly affects the encountering thread List- of- vars ensures that any compiler re- ordering moves all flushes together implicit: barrier, atomic, crijcal, locks

Atomic DirecJve #pragma omp atomic i++; Light- weight crijcal secjon Only for some expressions x = expr (no mutual exclusion on expr evaluajon) x++ ++x x- - - - x

Helper Functions: General void omp_set_dynamic (int); int omp_get_dynamic (); void omp_set_nested (int); int omp_get_nested (); int omp_get_num_procs(); int omp_get_num_threads(); int omp_get_thread_num(); int omp_get_ancestor_thread_num(); double omp_get_wtime(); 36

Helper Functions: Mutex void omp_init_lock (omp_lock_t *); void omp_destroy_lock (omp_lock_t *); void omp_set_lock (omp_lock_t *); void omp_unset_lock (omp_lock_t *); int omp_test_lock (omp_lock_t *); nested lock versions: e.g., omp_set_nest_lock(omp_test_lock_t *); 37

NesJng RestricJons A crijcal region may not be nested ever inside a crijcal region with the same name Not sufficient to prevent deadlock Not allowed without intervening parallel region: Inside work- sharing, crijcal, ordered, or master Work- sharing barrier Inside a work- sharing region master region Inside a crijcal region ordered region

EXAMPLES

Firstprivate and Lastprivate #pragma omp parallel for firstprivate( simple ) for (int i=0; i<n; i++) { simple += a[f1(i, omp_get_thread_num())] f2(simple); #pragma omp parallel for lastprivate( doneearly ) for( i=0; i<n; i++ ) { doneearly = f0(i);

Ordered Construct int i; #pragma omp for ordered for (i=0; i<n; i++) { if(isgroupa(i) { #pragma omp ordered doit(i); else { #pragma omp ordered doit(partner(i)); 41

Wrong Use of multiple Orders int i; #pragma omp for ordered for (i=0; i<n; i++) { #pragma omp ordered doit(i); #pragma omp ordered doit(partner(i)); 42

OpenMP Matrix MulJply

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ ) for( int j=0; j<n; j++ ) { c[i][j] = 0.0; for(int k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j];

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ ) for( int j=0; j<n; j++ ) { c[i][j] = 0.0; for(int k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; a, b, c are shared

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ ) for( int j=0; j<n; j++ ) { c[i][j] = 0.0; for(int k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; a, b, c are shared i, j, k are private

OpenMP Matrix MulJply

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ ) #pragma omp parallel for for( int j=0; j<n; j++ ) { c[i][j] = 0.0; for(int k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j];

OpenMP Matrix MulJply

OpenMP Matrix MulJply #pragma omp parallel for

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ )

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ ) #pragma omp parallel for

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ ) #pragma omp parallel for for( int j=0; j<n; j++ ) {

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ ) #pragma omp parallel for for( int j=0; j<n; j++ ) { int sum = 0.0;

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ ) #pragma omp parallel for for( int j=0; j<n; j++ ) { int sum = 0.0; #pragma omp parallel for reducjon(+:sum)

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ ) #pragma omp parallel for for( int j=0; j<n; j++ ) { int sum = 0.0; #pragma omp parallel for reducjon(+:sum) for(int k=0; k<n; k++ )

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ ) #pragma omp parallel for for( int j=0; j<n; j++ ) { int sum = 0.0; #pragma omp parallel for reducjon(+:sum) for(int k=0; k<n; k++ ) sum += a[i][k]*b[k][j];

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ ) #pragma omp parallel for for( int j=0; j<n; j++ ) { int sum = 0.0; #pragma omp parallel for reducjon(+:sum) for(int k=0; k<n; k++ ) sum += a[i][k]*b[k][j]; c[i][j] = sum;

OpenMP Matrix MulJply #pragma omp parallel for for( int i=0; i<n; i++ ) #pragma omp parallel for for( int j=0; j<n; j++ ) { int sum = 0.0; #pragma omp parallel for reducjon(+:sum) for(int k=0; k<n; k++ ) sum += a[i][k]*b[k][j]; c[i][j] = sum;

OpenMP Matrix Multiply: Triangular #pragma omp parallel for schedule (dynamic, 1 ) for( int i=0; i<n; i++ ) for( int j=i; j<n; j++ ) { c[i][j] = 0.0; for(int k=i; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; This multiplies upper-triangular matrix A with B Unbalanced workload Schedule improves this

OpenMP Jacobi for some number of Jmesteps/iteraJons { #pragma omp parallel for for (int i=0; i<n; i++ ) for( int j=0, j<n, j++ ) temp[i][j] = 0.25 * ( grid[i- 1][j] + grid[i+1][j] grid[i][j- 1] + grid[i][j+1] ); #pragma omp parallel for for( int i=0; i<n; i++ ) for( int j=0; j<n; j++ ) grid[i][j] = temp[i][j]; This could be improved by using just one parallel region Implicit barrier after loops eliminates race on grid

#pragma omp parallel shared(a, b, nthreads, locka, lockb) private(tid) #pragma omp sections nowait { #pragma omp section { omp_set_lock(&locka); for (i=0; i<n; i++) a[i] = i * DELTA; omp_set_lock(&lockb); for (i=0; i<n; i++) b[i] += a[i]; omp_unset_lock(&lockb); omp_unset_lock(&locka); #pragma omp section { omp_set_lock(&lockb); for (i=0; i<n; i++) b[i] = i * PI; omp_set_lock(&locka); for (i=0; i<n; i++) a[i] += b[i]; omp_unset_lock(&locka); omp_unset_lock(&lockb); /* end of sections */ Find the Error Assume: variables declared locks inijalized 48

void worksum(float *x, float *y, int *index, int n) { int i; #pragma omp parallel for shared(x, y, index, n) for (i=0; i<n; i++) { #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); int work0 = x[0]

void worksum(float *x, float *y, int *index, int n) { int i; #pragma omp parallel for shared(x, y, index, n) for (i=0; i<n; i++) { #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); int work0 = x[0] nowait

Find the Error void worksum(float *x, float *y, int *index, int n) { int i; #pragma omp parallel for shared(x, y, index, n) for (i=0; i<n; i++) { #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); int work0 = x[0] nowait

Efficiency Issues Minimize synchronizajon Avoid BARRIER, CRITICAL, ORDERED, and locks Use NOWAIT Use named CRITICAL secjons for fine- grained locking Use MASTER (instead of SINGLE) Parallelize at the highest level possible such as outer FOR loops keep parallel regions large FLUSH is expensive LASTPRIVATE has synchronizajon overhead Thread safe malloc/free are expensive Reduce False sharing Design of data structures Use PRIVATE

Common SMP Errors SynchronizaJon Race condijon depends on Jming Deadlock waijng for a non- existent condijon Livelock conjnuously adjusjng, but task progress stalled Try to Avoid nested locks Release locks religiously Avoid while true (especially, during tesjng) Be careful with Non thread- safe libraries Concurrent access to shared data IO inside parallel regions Differing views of shared memory (FLUSH) NOWAIT