COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including Distributed Shared Memory N.B. Reordered Message Based Parallel Computing with MPI Hybrid parallel programming With MPI and OpenMP Introduction to Multithreading Pre-fetching, simultaneous multithreading, chip multiprocessing (CMP), SMT 10/2/07 COMP4510 - Introduction to Parallel Computation 2 1

Shared Memory Machines Recall - UMA Shared Memory Architecture P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 cache cache cache cache cache cache cache cache Memory Traditional parallel computer One big box Quite expensive Relatively familiar environment Big memory Limited scalability 10/2/07 COMP4510 - Introduction to Parallel Computation 3 Shared Memory Machines (cont d) In a shared memory machine, all processors share a single memory More precisely, all processors address the same memory We may have local or shared variables but everything is addressable Because memory is shared, data that is written by one process/thread can be read by another Familiar programming model but... We must provide synchronization e.g. mutexes, barriers, etc. 10/2/07 COMP4510 - Introduction to Parallel Computation 4 2

Shared Memory Machines (cont d) This means that we will tend to write parallel programs that share data structures This is convenient You have probably all written a parallel program already using pthreads (or fork under Unix/Linux) in your OS class e.g. the producer consumer problem In parallel programming, the scale of parallelism will be larger i.e. many threads, not just a few 10/2/07 COMP4510 - Introduction to Parallel Computation 5 Shared Memory Programming Writing parallel programs for shared memory systems can be done in many ways e.g. Pthreads, process forking, etc. Parallel programming is hard so we look to use tools/techniques to simplify things OpenMP is a parallel programming system for shared memory machines Much simpler to use than pthreads, for example 10/2/07 COMP4510 - Introduction to Parallel Computation 6 3

What is OpenMP? OpenMP is an Application Programming Interface for shared memory parallel programming The OpenMP API offers the programmer full control of parallelization OpenMP is an industry standard Hence, OpenMP code is portable across different hardware architectures and operating systems OpenMP is available for Fortran and C/C++ OpenMP is ONLY compatible with shared memory machines (SMPs) 10/2/07 COMP4510 - Introduction to Parallel Computation 7 What s OpenMP? (cont d) A common way to use OpenMP is to take non-parallel code and annotate it with special directives to enable parallel execution Of course, new parallel code can also be developed from scratch as well but parallelizing a serial program might be easier again, the 80:20 rule applies get most of the parallelism in a parallelization effort at much lower cost than a complete redesign 10/2/07 COMP4510 - Introduction to Parallel Computation 8 4

What s OpenMP? (cont d) The OpenMP API consists of 3 primary components: Compiler Directives In OpenMP, the parallelism is defined by directives that are embedded in the source code These directives are ignored by the compiler unless you tell the compiler to parallelize the code with OpenMP Library Routines Environment Variables 10/2/07 COMP4510 - Introduction to Parallel Computation 9 What s OpenMP? (cont d) The OpenMP API consists of 3 primary components: Compiler Directives Library Routines OpenMP includes a set of library routines for dynamically querying and altering the parallelism at runtime For example, you can dynamically change the number of processors being used Environment Variables 10/2/07 COMP4510 - Introduction to Parallel Computation 10 5

What s OpenMP? (cont d) The OpenMP API consists of 3 primary components: Compiler Directives Library Routines Environment Variables The environment variables are set before the program is executed and they control the default parallelism the defaults may be dynamically altered by library calls 10/2/07 COMP4510 - Introduction to Parallel Computation 11 Parallel Regions OpenMP uses parallel regions to specify the blocks of code to be executed in parallel. The code that is outside of the parallel regions is run serially When the program reaches a parallel region it creates a number of threads and each thread executes the same block of code in the region separately Operating on different data (SPMD //ism) 10/2/07 COMP4510 - Introduction to Parallel Computation 12 6

Parallel Regions (cont d) OpenMP uses the fork-join model of parallel execution: Parallel Threads Master Thread F O R K J O I N F O R K J O I N Parallel Region Parallel Region 10/2/07 COMP4510 - Introduction to Parallel Computation 13 Parallel Regions (cont d) So, how do you code a parallel region? You use an OpenMP directive to indicate that a block of code is a parallel region In (Free Form) Fortran:!$OMP PARALLEL [block of code]!$omp END PARALLEL IN C/C++: #pragma omp parallel { [block of code] Note: In Fortran, all OpenMP directives begin with!$omp Note: In C/C++, all OpenMP directives begin with #pragma omp 10/2/07 COMP4510 - Introduction to Parallel Computation 14 7

Parallel Regions (cont d) Simple Example int main (void) { printf( Before Parallel Region\n ); #pragma omp parallel { printf( Inside Parallel Region\n ); printf( After Parallel Region\n ); 10/2/07 COMP4510 - Introduction to Parallel Computation 15 Parallel Regions (cont d) Master Thread printf(before Parallel Region) 3 Parallel Threads F O R K printf( Inside Parallel Region ) printf( Inside Parallel Region ) printf( Inside Parallel Region ) Parallel Region J O I N printf( After Parallel Region ) 10/2/07 COMP4510 - Introduction to Parallel Computation 16 8

Parallel Regions (cont d) OpenMP compilers typically generate pthreads code This makes it easy to add OpenMP support to existing compilers E.g. newer versions of gcc support pthreads enabled by a compiler switch gcc - gcc -fopenmp Advanced, parallelizing compilers may also support OpenMP E.g. The Portland Group compilers - pgcc -mp 10/2/07 COMP4510 - Introduction to Parallel Computation 17 Parallel Regions (cont d) Running an OpenMP program An OpenMP program is executed just like any other program (since its shared memory - 1 box) The number of processors to be used can be set using the OMP_NUM_THREADS environment variable (recommended!) or by using a library routine In BASH: export OMP_NUM_THREADS=value In CSH: setenv OMP_NUM_THREADS value 10/2/07 COMP4510 - Introduction to Parallel Computation 18 9

Parallel Regions (cont d) Compiling and Running a program $ gcc fopenmp simple_example.c $ export OMP_NUM_THREADS=2 $./a.out Before Parallel Region Inside Parallel Region Inside Parallel Region After Parallel Region $ export OMP_NUM_THREADS=3 $./a.out Before Parallel Region Inside Parallel Region Inside Parallel Region Inside Parallel Region After Parallel Region Compile Set number of processors to 2 Execute Set number of processors to 3 Execute again 10/2/07 COMP4510 - Introduction to Parallel Computation 19 Parallel Regions (cont d) There are two OpenMP library routines that are particularly useful: omp_get_thread_num, returns the rank of the current thread inside a parallel region The rank ranges from 0 to N-1, where N is the number of threads. Each thread has a unique rank omp_get_num_threads, returns the total number of parallel threads not necessarily equal to the number of processors 10/2/07 COMP4510 - Introduction to Parallel Computation 20 10

Parallel Regions (cont d) Another Example int main (void) { printf( Before Parallel Region\n ); #pragma omp parallel { printf( Rank %d\n, omp_get_thread_num()); printf( After Parallel Region\n ); 10/2/07 COMP4510 - Introduction to Parallel Computation 21 Parallel Regions (cont d) Master Thread printf( Before Parallel Region ) 3 Parallel Threads F O R K printf( Rank,0) printf( Rank,1) printf( Rank,2) Parallel Region J O I N printf( After Parallel Region ) 10/2/07 COMP4510 - Introduction to Parallel Computation 22 11

Parallel Regions (cont d) Another Example $ gcc fopenmp example.c $ export OMP_NUM_THREADS=3 $./a.out Before Parallel Region Inside Rank no. 2 Inside Rank no. 0 Inside Rank no. 1 After Parallel Region $ Compile Set number of processors to 3 Execute Note the order of the output! (parallel non-deterministic) 10/2/07 COMP4510 - Introduction to Parallel Computation 23 Parallel Regions (cont d) The calls, omp_get_thread_num and omp_get_num_threads provide the information necessary to write a SPMD (Single Program, Multiple Data) parallel program using the parallel region construct The idea is to subdivide the data to be processed into omp_get_num_threads pieces and let each thread work on its part Hence, SPMD 10/2/07 COMP4510 - Introduction to Parallel Computation 24 12

Parallel Regions (cont d) For example, if we want to search for a value, V, in a vector, VEC we can have each thread search in part of the vector First thread searches here Second thread searches here Last thread searches here VEC 10/2/07 COMP4510 - Introduction to Parallel Computation 25 Parallel Regions (cont d) Assuming the size of VEC is k and this is evenly divisible by the number of threads we can write an OpenMP code segment to do the searching First we compute the size of each piece of the vector (to be searched by each thread) Then we will call a function to search in our part of the vector Passing the start point and size of our piece 10/2/07 COMP4510 - Introduction to Parallel Computation 26 13

Parallel Regions (cont d) #pragma omp parallel { int size=k/omp_get_num_threads(); SrchSubVec (omp_get_thread_num()*size, size); 10/2/07 COMP4510 - Introduction to Parallel Computation 27 Variables Consider the following modification to our previous program: int main(int argc, char *argv[]) { int rank; printf( Before Parallel Region\n ); #pragma omp parallel { rank = omp_get_thread_num() printf( Inside Rank\n,rank); printf( After Parallel Region\n ); What happens if two threads attempt to assign a value to rank at the same time? 10/2/07 COMP4510 - Introduction to Parallel Computation 28 14

Variables (cont d) This leads to shared and private variables All threads share the same address space so all threads can modify the same variables which might result in undesirable behaviour Variables in a parallel region are shared by default! A private variable is only accessed by a single thread Variables are declared private with the private(list-of-variables) clause. The default(private) clause can be used to change the default to be private variables. In this case, shared variables must be declared using the shared(list-of-variables) clause. 10/2/07 COMP4510 - Introduction to Parallel Computation 29 Variables (cont d) The proper way of writing the previous example is: int main(int argc, char *argv[]) { int rank; printf( Before Parallel Region \n); #pragma omp parallel private(rank) { rank=omp_get_thread_num(); printf( Inside Rank %d\n,rank); printf( After Parallel Region ); Note: private clause added to the omp parallel directive 10/2/07 COMP4510 - Introduction to Parallel Computation 30 15

Loops OpenMP also provides a mechanism for parallelizing loop based computations There is a parallel for directive that can be used in C/C++ to indicate that a loop can be executed in parallel. The loop index is automatically declared as a private variable Consider the following simple sequential loop: for (i=0; i<100; i++) array[i] = i*i; 10/2/07 COMP4510 - Introduction to Parallel Computation 31 Loops (cont d) Because there are no dependences within the loop, it may be freely parallelized in OpenMP This is accomplished using the #pragma omp parallel for construct Note that the determination of freedom from dependences is up to the programmer not the compiler The revised example code would be: #pragma omp parallel for for (i=0; i<100; i++) array[i] = i*i; 10/2/07 COMP4510 - Introduction to Parallel Computation 32 16

Loops (cont d) #pragma omp parallel for { for (i=0;i<100;i++) array[i] = i*i Using 4 threads, this is how OpenMP will parallelize the loop in the previous example F O R K for(i=0;i<25... for(i=25;i<50... for(i=50;i<75... for(i=75;i<100... J O I N 10/2/07 COMP4510 - Introduction to Parallel Computation 33 Loops (cont d) What if we had some dependences though? #pragma omp parallel for { sum=0; for (i=0;i<100;i++) sum=sum+vec[i]; One solution to the problem would be to only let one thread update sum at any time. OpenMP provides a synchronization directive, critical, to define a critical region that only one thread can execute at a time #pragma omp critical { code 10/2/07 COMP4510 - Introduction to Parallel Computation 34 17

Loops (cont d) If more than one thread arrives at a critical region they will wait and execute the section sequentially But, synchronization is expensive so... We want to minimize the number of synchronization constructs while maintaining correctness Also, we want to do as much parallel computation as possible between synchronizations i.e. between critical regions, beginning/end of parallel region, start of parallel loops E.g. SGI recommends at least 1,000,000 floating point operations per synchronization for their SMP machines (typically 100 s of processors). 10/2/07 COMP4510 - Introduction to Parallel Computation 35 Loops (cont d) What overhead is involved in using a critical region around sum=sum+...? F O R K i=0 i=25 i=50 i=75 sum=sum+ i=1 sum=sum+ i=26 sum=sum+ i=51 sum=sum+ 10/2/07 COMP4510 - Introduction to Parallel Computation 36 18

Loops (cont d) Synchronization overhead F O R K i=1 i=25 i=50 i=75 sum=sum+ i=2 sum=sum+ i=26 sum=sum+ i=51 sum=sum+ 10/2/07 COMP4510 - Introduction to Parallel Computation 37 Loops (cont d) Synchronization overhead F O R K i=1 i=25 i=50 i=75 sum=sum+ i=2 sum=sum+ i=26 sum=sum+ i=51 sum=sum+ 10/2/07 COMP4510 - Introduction to Parallel Computation 38 19

Loops (cont d) In our examples of parallel loops so far, the data accessed in each loop iteration has been distinct from that access in other iterations This makes life very easy! In common parallel programming, this is not always the case Consider: for (i=1; i<100000; i++) { a[i]=a[i-1]+b[i]; 10/2/07 COMP4510 - Introduction to Parallel Computation 39 Loops (cont d) Iter 1: Iter 2: Iter 3: for (i=1; i<100000; i++) { a[i]=a[i-1]+b[i]; Iter 4: a[4] = a[1] = a[0] + b[1] a[2] = a[1] + b[2] a[3] = a[2] + b[3] 10/2/07 COMP4510 - Introduction to Parallel Computation 40 20

Loops (cont d) for (i=1; i<100000; i++) { a[i]=a[i-1]+b[i]; Iter 1: Iter 2: Iter 3: a[1] = a[0] + b[1] a[2] = a[1] + b[2] a[3] = a[2] + b[3] Iter 4: a[4] = 10/2/07 COMP4510 - Introduction to Parallel Computation 41 Loops (cont d) In cases where there are such dependences between loop iterations, we cannot use OpenMP parallel loop structures Such situations are either entirely nonparallelizable or require code restructuring E.g. manual division of the loop into parallel iterations of size 8 assuming a dependence distance of 8 10/2/07 COMP4510 - Introduction to Parallel Computation 42 21

Reduction Variables The REDUCTION(operator:variable) clause performs a reduction on the listed variable. i.e. Some sort of aggregation of partial results Each thread is assigned a private copy of the variable. At the end of the parallel region, the reduction operator is applied to each private copy and the result is copied to the shared variable. #pragma omp parallel for reduction(+:sum) { for (i=1;i<=100;i++) { sum = sum+vec[i] C code for vector sum using reduction 10/2/07 COMP4510 - Introduction to Parallel Computation 43 Reduction Variables (cont d) #pragma omp parallel for reduction(+:sum) for(i=1;i<=100;i++) sum = sum+vec[i] F O R K for i=1,25 sum 0 =... for i=26,50 sum 1 =......... REDUCE sum=sum 0+sum 1+... J O I N 10/2/07 COMP4510 - Introduction to Parallel Computation 44 22

Reduction Variables (cont d) #pragma omp parallel for reduction(+:sum) for(i=1;i<=100;i++) sum = sum+vec[i] F O R K Each thread computes its partial sum independently => far less overhead than using critical regions! for i=1,25 sum 0 =... for i=26,50 sum 1 =......... REDUCE sum=sum 0+sum 1+... J O I N 10/2/07 COMP4510 - Introduction to Parallel Computation 45 Timing Runs We ve seen code execute in parallel and when we included appropriate printf/write statements the parallel execution was obvious But, how do we know if we are executing in parallel without the printf/writes? The code should run faster! But, how do we know its running faster? And, more importantly how do we determine how much speedup are we getting from running in parallel? 10/2/07 COMP4510 - Introduction to Parallel Computation 46 23

Timing Runs (cont d) A useful Unix/Linux command for measuring runtime for OpenMP programs is the time command: time your_program [arguments] The time command will run your program and collect some timing statistics: Elapsed real time CPU time used by the program. This is split between user and system time, representing time spent executing your program code and time spend executing O/S functions (e.g. I/O calls), respectively. 10/2/07 COMP4510 - Introduction to Parallel Computation 47 Timing Runs (cont d) The output format of time varies between different versions of the command. Under the BASH shell: Use the elapsed time to determine a program s parallel efficiency. In this example, the program runs 5.282/1.456=3.6 times faster on four processors compare to one processor. Elapsed time CPU time Note how real and user time changes on 4 threads $ export OMP_NUM_THREADS=1 $ time./my_program real 0m5.282s user 0m4.980s sys 0m0.010s $ export OMP_NUM_THREADS=4 $ time./my_program real 0m1.456s user 0m5.140s sys 0m0.020s 10/2/07 COMP4510 - Introduction to Parallel Computation 48 24

Timing Runs (cont d) You must remember that using the Unix/ Linux time command times everything sometimes this is good and sometimes not E.g. what if you want to time part of your program? Also, this command is not applicable when you are running parallel code on clusters since there is more than one machine being used and more than one command being run More on timing parallel programs later! 10/2/07 COMP4510 - Introduction to Parallel Computation 49 25