CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread Level Parallelism: OpenMP

Size: px

Start display at page:

Download "CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread Level Parallelism: OpenMP"

Candice Young
6 years ago
Views:

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread Level Parallelism: OpenMP Instructors: Randy H. Katz David A. PaGerson hgp://inst.

g., Lookup, Ads So6ware Parallel InstrucYons >1 instrucyon @ one Yme e.g., 5 pipelined instrucyons Parallel Data >1 data item @ one Yme e.g., Add of 4 pairs of words Hardware descripyons All gates funcyoning in parallel at same Yme You Are Here!

Logic Gates 3 Computer (Cache) Core FuncYonal Unit(s) A 0 +B 0 A 1 +B 1 A 2 +B 2 A 3 +B 3 Project 3 Review: SynchronizaYon in MIPS Load linked: ll

locayon has changed Returns 0 in rt (clobbers register value being stored) Example: atomic swap (to test/set lock variable) Exchange contents of reg and mem:

$s4,$zero,$t1 ;put load value in $s4 4 OpenMP IntroducYon Administrivia OpenMP Examples Technology Break Common Pijalls Summary Agenda OpenMP IntroducYon

1 CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread Level Parallelism: OpenMP Instructors: Randy H. Katz David A. PaGerson hgp://inst.eecs.berkeley.edu/~cs61c/sp11 3/15/11 1 Spring Lecture #16 2 Parallel Requests Assigned to computer e.g., Search Katz Parallel Threads Assigned to core e.g., Lookup, Ads So6ware Parallel InstrucYons >1 one Yme e.g., 5 pipelined instrucyons Parallel Data >1 data one Yme e.g., Add of 4 pairs of words Hardware descripyons All gates funcyoning in parallel at same Yme You Are Here! Harness Parallelism & Achieve High Performance Hardware Warehouse Scale Computer Core Memory Input/Output InstrucYon Unit(s) Main Memory Core Smart Phone Logic Gates 3 Computer (Cache) Core FuncYonal Unit(s) A 0 +B 0 A 1 +B 1 A 2 +B 2 A 3 +B 3 Project 3 Review: SynchronizaYon in MIPS Load linked: ll rt,offset(rs) Store condiyonal: sc rt,offset(rs) Succeeds if locayon not changed since the ll Returns 1 in rt (clobbers register value being stored) Fails if locayon has changed Returns 0 in rt (clobbers register value being stored) Example: atomic swap (to test/set lock variable) Exchange contents of reg and mem: $s4 ($s1) try: add $t0,$zero,$s4 ;copy exchange value ll $t1,0($s1) ;load linked sc $t0,0($s1) ;store conditional beq $t0,$zero,try ;branch store fails add $s4,$zero,$t1 ;put load value in $s4 4 OpenMP IntroducYon Administrivia OpenMP Examples Technology Break Common Pijalls Summary Agenda OpenMP IntroducYon Administrivia OpenMP Examples Technology Break Common Pijalls Summary Agenda 5 6 1

$Machines in 61C Lab /usr/sbin/sysctl -a grep hw\. hw.model = MacPro4,1 hw.cachelinesize = 64 hw.l1icachesize: 32,768 hw.physicalcpu: 8 hw.l1dcachesize: 32,768 hw.logicalcpu: 16 hw.$ model = MacBookAir3,1 hw.physicalcpu: 2 hw.logicalcpu: 2 hw.cpufrequency: 1,600,000,000 hw.physmem = 2,147,483,648 Randy s Laptop hw.cachelinesize = 64 hw.l1icachesize = 32768 hw.

model = MacBookAir3,1 hw.physicalcpu: 2 hw.logicalcpu: 2 hw.cpufrequency: 1,600,000,000 hw.physmem = 2,147,483,648 Randy s Laptop hw.cachelinesize = 64 hw.l1icachesize = 32768 hw.

OpenMP SpecificaYon API used for muly- threaded, shared memory parallelism Compiler DirecYves RunYme Library RouYnes Environment Variables Portable Standardized See hgp://www.openmp.

2 Machines in 61C Lab /usr/sbin/sysctl -a grep hw\. hw.model = MacPro4,1 hw.cachelinesize = 64 hw.l1icachesize: 32,768 hw.physicalcpu: 8 hw.l1dcachesize: 32,768 hw.logicalcpu: 16 hw.l2cachesize: 262,144 hw.l3cachesize: 8,388,608 hw.cpufrequency = 2,260,000,000 Try up to 16 threads to hw.physmem = see if performance gain 2,147,483,648 even though only 8 cores hw.model = MacBookAir3,1 hw.physicalcpu: 2 hw.logicalcpu: 2 hw.cpufrequency: 1,600,000,000 hw.physmem = 2,147,483,648 Randy s Laptop hw.cachelinesize = 64 hw.l1icachesize = hw.l1dcachesize = hw.l2cachesize = 3,145,728 No l3 cache Dual core One hw context per core 7 8 What is OpenMP? OpenMP SpecificaYon API used for muly- threaded, shared memory parallelism Compiler DirecYves RunYme Library RouYnes Environment Variables Portable Standardized See hgp:// hgp://compuyng.llnl.gov/tutorials/openmp/ 9 10 Shared Memory Model with Explicit Thread- based Parallelism Shared memory process consists of mulyple threads, explicit programming model with full programmer control over parallelizayon Pros: Takes advantage of shared memory, programmer need not worry (that much) about data placement Programming model is serial- like and thus conceptually simpler than alternayves (e.g., message passing/mpi) Compiler direcyves are generally simple and easy to use Legacy serial code does not need to be rewrigen Cons: Codes can only be run in shared memory environments! Compiler must support OpenMP (e.g., gcc 4.2) 11 OpenMP Uses the C Extension Pragmas Mechanism Pragmas are a preprocessor mechanism that C provides for language extensions Many different uses: structure packing, symbol aliasing, floayng point excepyon modes, Good for OpenMP because compilers that don't recognize a pragma are supposed to ignore them Runs on sequenyal computer even with embedded pragmas 12 2

OpenMP Programming Model OpenMP DirecYves Fork - Join Model: OpenMP programs begin as single process: master thread; Executes sequenyally unyl the first parallel region construct is encountered FORK:

the team threads complete the statements in the parallel region construct, they synchronize and terminate, leaving only the master thread shares iterayons of a loop across the team each secyon

3 OpenMP Programming Model OpenMP DirecYves Fork - Join Model: OpenMP programs begin as single process: master thread; Executes sequenyally unyl the first parallel region construct is encountered FORK: the master thread then creates a team of parallel threads Statements in program that are enclosed by the parallel region construct are executed in parallel among the various team threads JOIN: When the team threads complete the statements in the parallel region construct, they synchronize and terminate, leaving only the master thread shares iterayons of a loop across the team each secyon executed by a separate thread serializes the execuyon of a thread Building Block: C for loop for (i=0; i<max; i++) zero[i] = 0; Break for loop into chunks, and allocate each to a separate thread E.g., if max = 100, with two threads, assign 0-49 to thread 0, to thread 1 Must have relayvely simple shape for an OpenMP- aware compiler to be able to parallelize it Necessary for the run- Yme system to be able to determine how many of the loop iterayons to assign to each thread No premature exits from the loop allowed i.e., No break, return, exit, goto statements 15 OpenMP: Parallel for pragma #pragma omp parallel for for (i=0; i<max; i++) zero[i] = 0; Master thread creates addiyonal threads, each with a separate execuyon context All variables declared outside for loop are shared by default, except for loop index which is private per thread (Why?) Implicit synchronizayon at end of for loop Divide index regions sequenyally per thread Thread 0 gets 0, 1,, (max/n)- 1; Thread 1 gets max/n, max/n+1,, 2*(max/n)- 1 Why? 16 Student RouleGe? Thread CreaYon How many threads will OpenMP create? Defined by OMP_NUM_THREADS environment variable (or in code procedure call) Set this variable to the maximum number of threads you want OpenMP to use Usually equals the number of cores in the underlying HW on which the program is run OMP_NUM_THREADS Shell command to set number threads: export OMP_NUM_THREADS=x Shell command check number threads: echo $OMP_NUM_THREADS OpenMP intrinsic to set number of threads: omp_num_threads(x); OpenMP intrinsic to get number of threads: num_th = omp_get_num_threads(); OpenMP intrinsic to get Thread ID number: th_id = omp_get_thread_num();

Parallel Threads and Scope Agenda Each thread executes a copy of

omp_get_thread_num(); foo(id); OpenMP default is shared

IntroducYon Administrivia OpenMP Examples Technology Break

Administrivia 20 Administrivia Regrade Policy Next Lab and

this week Wri@en appeal process Lab #9: Thread Level

Project #3: Matrix MulYply Performance Improvement Explain

your TA in this week s laboratory Work in groups of two!

(end of Spring Break) Part 2: April 3 21 22 Agenda 23 hgp://en.

4 Parallel Threads and Scope Agenda Each thread executes a copy of the code within the structured block #pragma omp parallel ID = omp_get_thread_num(); foo(id); OpenMP default is shared variables To make private, need to declare with pragma OpenMP IntroducYon Administrivia OpenMP Examples Technology Break Common Pijalls Summary #pragma omp parallel private (x) 19 Administrivia 20 Administrivia Regrade Policy Next Lab and Project Rubric on- line QuesYons covered in Discussion SecYon this week Wri@en appeal process Lab #9: Thread Level Parallelism, this week HW #5: due March 27, Make and Threads Project #3: Matrix MulYply Performance Improvement Explain rayonale for regrade request AGach rayonale to exam Submit to your TA in this week s laboratory Work in groups of two! SYck around a er class for shotgun marriage Part 1: March 27 (end of Spring Break) Part 2: April Agenda 23 hgp://en.wikipedia.org/wiki/daimler_dingo OpenMP IntroducYon Administrivia OpenMP Examples Technology Break Common Pijalls Summary 24 4

5 OpenMP Examples Hello World OMP Get Environment (hidden slides) For Loop Workshare SecYon Workshare SequenYal and Parallel Numerical CalculaYon of Pi Hello World int main (void) int nthreads, tid; /* Fork team of threads with each having a private tid variable */ #pragma omp parallel private(tid) /* Obtain and print thread id */ tid = omp_get_thread_num(); printf("hello World from thread = %d\n", tid); /* Only master thread does this */ if (tid == 0) nthreads = omp_get_num_threads(); printf("number of threads = %d\n", nthreads); /* All threads join master thread and terminate */ Hello World localhost:openmp randykatz$./omp_hello Hello World from thread = 0 Hello World from thread = 1 27 Get Environment InformaYon #include <stdlib.h> int main (int argc, char *argv[]) int nthreads, tid, procs, maxt, inpar, dynamic, nested; #pragma omp parallel private(nthreads, tid) tid = omp_get_thread_num(); /* Obtain thread number */ if (tid == 0) /* Only master thread does this */ printf("thread %d getting environment info...\n", tid); /* Get environment information */ procs = omp_get_num_procs(); nthreads = omp_get_num_threads(); maxt = omp_get_max_threads(); inpar = omp_in_parallel(); dynamic = omp_get_dynamic(); nested = omp_get_nested(); 3/14/11 Spring Lecture #16 28 Get Environment InformaYon /* Print environment information */ printf("number of processors = %d\n", procs); printf("number of threads = %d\n", nthreads); printf("max threads = %d\n", maxt); printf("in parallel? = %d\n", inpar); printf("dynamic threads enabled? = %d\n", dynamic); printf("nested parallelism supported? = %d\n", nested); /* Parallel Done */ Get Environment InformaYon Thread 0 getting environment info... Number of processors = 2 Max threads = 1 In parallel? = 1 Dynamic threads enabled? = 0 Nested parallelism supported? = 0 Executes two threads Prints informadon from thread 0 only At this point just one thread available, hence max threads = 1 3/14/11 Spring Lecture # /14/11 Spring Lecture #

6 Workshare for loop Example #1 #include <stdlib.h> #define CHUNKSIZE 10 #define N 100 int main (int argc, char *argv[]) int nthreads, tid, i, chunk; float a[n], b[n], c[n]; /* Some initializations */ for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE; 31 Workshare for loop Example #1 #pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid) tid = omp_get_thread_num(); if (tid == 0) nthreads = omp_get_num_threads(); printf("number of threads = %d\n", nthreads); printf("thread %d starting...\n",tid); #pragma omp for schedule(dynamic,chunk) for (i=0; i<n; i++) c[i] = a[i] + b[i]; printf("thread %d: c[%d]= %f\n",tid,i,c[i]); /* end of parallel section */ The iteradons of a loop are scheduled dynamically across the team of threads. A thread will perform CHUNK iteradons at a Dme before being scheduled for the next CHUNK of work. 32 Workshare for loop Example #1... Thread 0 staryng... Thread 1: c[82]= Thread 0: c[0]= Thread 1: c[83]= Thread 0: c[1]= Thread 1: c[84]= Thread 0: c[2]= Thread 1: c[85]= Thread 0: c[3]= Thread 0: c[70]= Thread 0: c[4]= Thread 1: c[86]= Thread 0: c[5]= Thread 0: c[71]= Thread 0: c[6]= Thread 1: c[87]= Thread 0: c[7]= Thread 0: c[72]= Thread 0: c[8]= Thread 1: c[88]= Thread 0: c[9]= Thread 0: c[73]= Thread 0: c[10]= Thread 1: c[89]= Thread 1 staryng... Thread 0: c[74]= Thread 0: c[11]= Thread 1: c[90]= Thread 1: c[20]= Thread 0: c[75]= Thread 0: c[12]= Thread 1: c[91]= Thread 1: c[21]= Thread 0: c[76]= Thread 0: c[13]= Thread 0: c[77]= Thread 1: c[22]= Thread 0: c[78]= Thread 0: c[14]= Thread 0: c[79]= Thread 1: c[23]= Thread 1: c[92]= Thread 0: c[15]= Thread 1: c[93]= Thread 1: c[24]= Thread 1: c[94]= Thread 0: c[16]= Thread 1: c[95]= Thread 1: c[25]= Thread 1: c[96]= Thread 1: c[26]= Thread 1: c[97]= Thread 1: c[27]= Thread 1: c[98]= Thread 1: c[28]= Thread 1: c[99]= Workshare for loop Example #1... Thread 0 staryng... Thread 0: c[78]= Thread 1 staryng... Thread 1: c[25]= Thread 0: c[0]= Thread 0: c[79]= Thread 1: c[10]= Thread 1: c[26]= Thread 0: c[1]= Thread 0: c[80]= Thread 1: c[11]= Thread 1: c[27]= Thread 0: c[2]= Thread 0: c[81]= Thread 1: c[12]= Thread 1: c[28]= Thread 0: c[3]= Thread 0: c[82]= Thread 1: c[13]= Thread 1: c[29]= Thread 0: c[4]= Thread 0: c[83]= Thread 1: c[14]= Thread 1: c[90]= Thread 0: c[5]= Thread 0: c[84]= Thread 1: c[15]= Thread 1: c[91]= Thread 0: c[6]= Thread 0: c[85]= Thread 1: c[16]= Thread 1: c[92]= Thread 1: c[17]= Thread 0: c[86]= Thread 1: c[18]= Thread 1: c[93]= Thread 0: c[7]= Thread 0: c[87]= Thread 1: c[19]= Thread 1: c[94]= Thread 0: c[8]= Thread 0: c[88]= Thread 1: c[20]= Thread 1: c[95]= Thread 0: c[9]= Thread 0: c[89]= Thread 1: c[21]= Thread 1: c[96]= Thread 0: c[30]= Thread 1: c[97]= Thread 1: c[22]= Thread 1: c[98]= Thread 0: c[31]= Thread 1: c[99]= Thread 1: c[23]= A er export OMP_IN_PARALLEL=2 Workshare sections Example #2 #include <stdlib.h> #define N 50 int main (int argc, char *argv[]) int i, nthreads, tid; float a[n], b[n], c[n], d[n]; /* Some initializations */ for (i=0; i<n; i++) a[i] = i * 1.5; b[i] = i ; c[i] = d[i] = 0.0; Workshare sections Example #2 #pragma omp parallel shared(a,b,c,d,nthreads) private(i,tid) tid = omp_get_thread_num(); if (tid == 0) nthreads = omp_get_num_threads(); printf("number of threads = %d\n", nthreads); printf("thread %d starting...\n",tid); #pragma omp sections nowait #pragma omp section printf("thread %d doing section 1\n",tid); for (i=0; i<n; i++) c[i] = a[i] + b[i]; printf("thread %d: c[%d]= %f\n",tid,i,c[i]); SecDons

$Workshare sections Example #2 #pragma omp section printf("thread %d doing section 2\n",tid); for (i=0; i<n; i++) d[i] = a[i] * b[i]; printf("thread %d: d[%d]= %f\n",tid,i,d[i]); /* end of sections */$

7 Workshare sections Example #2 #pragma omp section printf("thread %d doing section 2\n",tid); for (i=0; i<n; i++) d[i] = a[i] * b[i]; printf("thread %d: d[%d]= %f\n",tid,i,d[i]); /* end of sections */ printf("thread %d done.\n",tid); /* end of parallel section */ SecDons 37 Workshare sections Example #2... Thread 0 staryng... Thread 0: d[26]= Thread 0 doing secyon 1 Thread 0: d[27]= Thread 0: c[0]= Thread 0: d[28]= Thread 0: c[1]= Thread 0: d[29]= Thread 0: c[2]= Thread 0: d[30]= Thread 0: c[3]= Thread 0: d[31]= Thread 0: c[4]= Thread 0: d[32]= Thread 0: c[5]= Thread 0: d[33]= Thread 0: c[6]= Thread 0: d[34]= Thread 0: c[7]= Thread 0: d[35]= Thread 0: c[8]= Thread 0: d[36]= Thread 0: c[9]= Thread 0: d[37]= Thread 0: c[10]= Thread 0: d[38]= Thread 0: c[11]= Thread 0: d[39]= Thread 0: c[12]= Thread 0: d[40]= Thread 0: c[13]= Thread 0: d[41]= Thread 0: c[14]= Thread 0: d[42]= Thread 0: c[15]= Thread 0: d[43]= Thread 0: c[16]= Thread 0: d[44]= Thread 0: c[17]= Thread 0: d[45]= Thread 0: c[18]= Thread 0: d[46]= Thread 0: c[19]= Thread 0: d[47]= Thread 0: c[20]= Thread 0: d[48]= Thread 0: c[21]= Thread 0: d[49]= Thread 0: c[22]= Thread 0 done. Thread 0: c[23]= Thread 1 staryng... Thread 0: c[24]= Thread 1 done. 38 Workshare sections Example #2... Thread 1 staryng... Thread 1: c[33]= Thread 0 staryng... Thread 0: d[41]= Thread 1 doing secyon 1 Thread 1: c[34]= Thread 0 doing secyon 2 Thread 0: d[42]= Thread 1: c[0]= Thread 1: c[35]= Thread 0: d[0]= Thread 0: d[43]= Thread 1: c[1]= Thread 1: c[36]= Thread 0: d[1]= Thread 0: d[44]= Thread 1: c[2]= Thread 1: c[37]= Thread 0: d[2]= Thread 1: c[38]= Thread 1: c[3]= Thread 1: c[39]= Thread 0: d[3]= Thread 1: c[40]= Thread 1: c[4]= Thread 1: c[41]= Thread 0: d[4]= Thread 1: c[42]= Thread 1: c[5]= Thread 1: c[43]= Thread 0: d[5]= Thread 1: c[44]= Thread 1: c[6]= Thread 0: d[45]= Thread 0: d[6]= Thread 1: c[45]= Thread 1: c[7]= Thread 0: d[46]= Thread 0: d[7]= Thread 1: c[46]= Thread 0: d[8]= Thread 0: d[47]= Thread 0: d[9]= Thread 1: c[47]= Thread 0: d[10]= Thread 0: d[48]= Thread 0: d[11]= Thread 0: d[49]= Thread 0: d[12]= Thread 0 done. Nowait in secyons Thread 0: d[13]= Thread 1: c[48]= Thread 0: d[14]= Thread 1: c[49]= Thread 0: d[15]= Thread 1 done. 39 A er export OMP_IN_PARALLEL=2 Riemann Sum ApproximaYon: CalculaYng π using Numerical IntegraYon P is a paryyon of [a, b] into n subintervals, [x 0, x 1 ], [x 1, x 2 ],..., [x n - 1, x n ] each of width x. x i * is in the interval [x i - 1, x i ] for each i. As P 0, n infinity and x 0. x i * to be the midpoint of each subinterval, i.e. x i * = ½(x i- 1 + x i ), for each i from 1 to n. hgp:// bjw 40 SequenYal CalculaYon of π in C 41 /* Serial Code */ static long num_steps = ; double step; int main (int argc, char *argv[]) int i; double x, pi, sum = 0.0; for (i=0; i < num_steps; i++) x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); pi = step/num_steps; printf ("pi = %6.12f\n", pi); 42 pi =

8 OpenMP Version #1 static long num_steps = ; double step; #define NUM_THREADS 2 int main (int argc, char *argv[]) int i, id; double x, pi, sum[num_threads]; #pragma omp parallel private(x) id = omp_get_thread_num(); for (i = id, sum[id] = 0.0; i < num_steps; i = i + NUM_THREADS) x = (i + 0.5)*step; sum[id] += 4.0/(1.0+x*x); for(i = 0, pi = 0.0; i < NUM_THREADS; i++) pi += sum[i] ; printf ("pi = %6.12f\n", pi / num_steps); 43 OpenMP Version #1 (with bug) pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = OpenMP CriYcal SecYon int main(void) int x; x = 0; #pragma omp parallel shared(x) #pragma omp critical x = x + 1; /* end of parallel section */ Only one thread executes the following code block at a Yme Compiler generates necessary lock/unlock code around the increment of x 45 Student RouleGe? OpenMP Version #2 static long num_steps = ; double step; #define NUM_THREADS 2 int main (int argc, char *argv[]) int i, id; double x, sum, pi=0.0; #pragma omp parallel private (x, sum) id = omp_get_thread_num(); for (i = id, sum = 0.0; i < num_steps; i = i + NUM_THREADS) x = (i + 0.5)*step; sum += 4.0/(1.0+x*x); #pragma omp critical pi += sum; printf ("pi = %6.12f,\n", pi/num_steps); 46 OpenMP Version #2 (with bug) pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = OpenMP ReducYon ReducDon: specifies that one or more variables that are private to each thread are subject of reducyon operayon at end of parallel region: reduction(operation:var) where OperaDon: operator to perform on the variables (var) at the end of the parallel region Var: One or more variables on which to perform scalar reducyon #pragma omp for reduction(+:nsum) for (i = START ; i <= END ; i++) nsum += i; 48 8

OpenMP Version #3 /static long num_steps = 100000; double step; int main (int argc, char *argv[]) int i; double x, pi, sum = 0.

9 OpenMP Version #3 /static long num_steps = ; double step; int main (int argc, char *argv[]) int i; double x, pi, sum = 0.0; Note: Don t have to declare for loop index variable i private, since that is default #pragma omp parallel for private(x) reduction(+:sum) for (i = 0; i < num_steps; i++) x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); pi = sum / num_steps; printf ( pi = %6.8f\n, pi); 49 OpenMP Timing omp_get_wyme Elapsed wall clock Yme double omp_get_wtime(void); // to get function Elapsed wall clock Yme in seconds. Time is measured per thread, no guarantee can be made that two disynct threads measure the same Yme. Time is measured from some "Yme in the past". On POSIX compliant systems the seconds since the Epoch (00:00:00 UTC, January 1, 1970) are returned. 50 OpenMP Version #3 Agenda pi = in seconds pi = in seconds pi = in seconds pi = in seconds pi = in seconds pi = in seconds pi = in seconds pi = in seconds pi = in seconds pi = in seconds pi = in seconds pi = in seconds pi = in seconds pi = in seconds pi = in seconds pi = in seconds pi = in seconds pi = in seconds pi = in seconds pi = in seconds pi = in seconds pi = in seconds pi = in seconds pi = in seconds OpenMP IntroducYon Administrivia OpenMP Examples Technology Break Common Pijalls Summary OpenMP Version #4 static long num_steps = ; double step; int main (int argc, char *argv[]) int i, id; double x, pi, sum; sum = 0.0; #pragma omp parallel #pragma omp for for (i = 0; i < num_steps; i++) x = (i + 0.5)*step; sum += 4.0/(1.0+x*x); printf ("pi = %6.12f\n", step * sum); 53 3/14/11 Spring Lecture #

10 OpenMP Version #4 (with bug) pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = pi = OpenMP Version #5 static long num_steps = ; double step; int main (int argc, char *argv[]) int i, id; double x, pi, sum, sum0; sum = 0.0; #pragma omp parallel private(x,sum0) shared (step,sum) sum0 = 0.0; #pragma omp for for (i = 0; i < num_steps; i++) x = (i + 0.5)*step; sum0 += 4.0/(1.0+x*x); sum = sum + sum0; printf ("pi = %6.12f\n", step * sum); 3/14/11 Spring Lecture # /14/11 Spring Lecture #16 56 OpenMP Version #6 static long num_steps = ; double step; int main (int argc, char *argv[]) int i, id; double x, pi, sum, sum0; sum = 0.0; #pragma omp parallel private(x,sum0) shared (step,sum) sum0 = 0.0; #pragma omp for for (i = 0; i < num_steps; i++) x = (i + 0.5)*step; sum0 += 4.0/(1.0+x*x); #pragma omp critical sum = sum + sum0; printf ("pi = %6.12f\n", step * sum); 3/14/11 Spring Lecture #16 57 OpenMP Version #6 3/14/11 Spring Lecture #16 58 DescripYon of 32 Core System Intel Nehalem Xeon 7550 HW MulYthreading: 2 Threads / core 8 cores / chip 4 chips / board 64 Threads / system 2.00 GHz 256 KB L2 cache/ core 18 MB (!) shared L3 cache / chip How more than 1 thread per core? 59 Student RouleGe? Matrix MulYply start_time = omp_get_wtime(); #pragma omp parallel for private(tmp, i, j, k) for (i=0; i<ndim; i++) for (j=0; j<mdim; j++) tmp = 0.0; for( k=0; k<pdim; k++) Note: Outer loop spread across N threads; inner loops inside a thread /* C(i,j) = sum(over k) A(i,k) * B(k,j) */ tmp += *(A+(i*Ndim+k)) * *(B+(k*Pdim+j)); *(C+(i*Ndim+j)) = tmp; run_time = omp_get_wtime() - start_time; 60 10

11 Notes on Matrix MulYply Example More performance opymizayons available Higher compiler opymizayon (- O2, - O3) to reduce number of instrucyons executed Cache blocking to improve memory performance Using SIMD SSE3 InstrucYons to raise floayng point computayon rate OpenMP Pijall #1: Data Dependencies Consider the following code: a[0] = 1; for(i=1; i<5; i++) a[i] = i + a[i-1]; There are dependencies between loop iterayons SecYons of loops split between threads will not necessarily execute in order Out of order loop execuyon will result in undefined behavior Open MP Pijall #2: Avoiding Dependencies by Using Private Variables Consider the following loop: #pragma omp parallel for for(i=0; i<n; i++) temp = 2.0*a[i]; a[i] = temp; b[i] = c[i]/temp; Threads share common address space: will be modifying temp simultaneously; soluyon: #pragma omp parallel for private(temp) for(i=0; i<n; i++) temp = 2.0*a[i]; a[i] = temp; b[i] = c[i]/temp; 63 OpenMP Pijall #3: UpdaYng Shared Variables Simultaneously Now consider a global sum: for(i=0; i<n; i++) sum = sum + a[i]; This can be done by surrounding the summayon by a criycal secyon, but for convenience, OpenMP also provides the reducyon clause: #pragma omp parallel for reduction(+:sum) for(i=0; i<n; i++) sum = sum + a[i]; Compiler can generate highly efficient code for reducyon 64 OpenMP Pijall #3: Parallel Overhead Spawning and releasing threads results in significant overhead Therefore, you want to make your parallel regions as large as possible Parallelize over the largest loop that you can (even though it will involve more work to declare all of the private variables and eliminate dependencies) Coarse granularity is your friend! And in Conclusion, SequenYal so ware is slow so ware SIMD and MIMD only path to higher performance MulYprocessor/MulYcore uses Shared Memory Cache coherency implements shared memory even with mulyple copies in mulyple caches False sharing a concern; watch block size! Data races lead to subtle parallel bugs SynchronizaYon via atomic operayons: MIPS does it with Load Linked + Store CondiYonal OpenMP as simple parallel extension to C Threads, Parallel for, private, criycal secyons,

You Are Here! Peer Instruction: Synchronization. Synchronization in MIPS. Lock and Unlock Synchronization 7/19/2011

You Are Here! Peer Instruction: Synchronization. Synchronization in MIPS. Lock and Unlock Synchronization 7/19/2011 CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread Level Parallelism: OpenMP Instructor: Michael Greenbaum 1 Parallel Requests Assigned to computer e.g., Search Katz Parallel Threads