Synchronization. Event Synchronization

Synchronization Synchronization: mechanisms by which a parallel program can coordinate the execution of multiple threads Implicit synchronizations Explicit synchronizations Main use of explicit synchronization to control the access to shared objects: Mutual exclusion Event synchronization 1 Event Synchronization This construct is used to signal the occurrence of an event though multiple threads. Eg: Barrier Master Order Explicit synchronizations > event synchronization 2 1

Barrier Directive Threads in a team wait until entire team reaches the barrier!$omp PARALLEL!$OMP DO REDUCTION(+:S) DO I = 1, 100 S = S + F(I) END DO!$OMP END DO NOWAIT.! Wait for all the threads to reach this point!$omp BARRIER PRINT *, S!$OMP END PARALLEL Explicit synchronizations > event synchronization > barrier 3 Master Directive Only the master thread executes the enclosed block of code!$omp PARALLEL!$OMP DO DO i = 1, n complex calculations here ENDO!$OMP MASTER PRINT *, intermediate results!$omp END MASTER continue next calculations.!$omp END PARALLEL Explicit synchronizations > event synchronization > master directive 4 2

Ordered Directive The portion of code within the loop iteration enclosed in a ordered section must be executed in the original, sequential order of the loop iterations!$omp PARALLEL DO ORDERED DO i = 1, n a(i) = complex calculations here! Wait until the previous iteration has! finished its ordered section!$omp ORDERED PRINT *, a(i)! Signal the completion of ordered! from this iterations!$omp END ORDERED END DO Explicit synchronizations > event synchronization > ordered directive 5 Parallel Overhead Master thread has to start the slaves Iterations have to be divided among threads Threads must synchronize at the end of the loop Each iteration of a loop involves a certain amount of work, e.g.: Integer and floating point operations Loads and stores of memory locations Control flow instructions such as subroutine calls and branches 6 3

Reducing Overhead: If Clause if (n.gt. 800) then!$omp parallel do do i = 1, n z(i) = a * x(i) + y endo else do i = 1, n z(i) = a * x(i) + y endo endif!$omp parallel do if (n.gt. 800) do i = 1, n z(i) = a * x(i) + y endo Avoid parallel overhead at low trip-counts 7 do j = 2, n Reducing Overhead: Loop Interchangeable!$omp parallel do do i = 1, n!! Parallelizable a(i, j) = a(i, j) + a(i, j - 1)!! Not parallelizable loop with data!! Dependency Reduce parallel overhead though loop interchangeable!$omp parallel do do i = 1, n do j = 2, n a(i, j) = a(i, j) + a(i, j - 1)!!!! But worse utilization of the memory 8 4

Spatial Locality do i = 1, n do j = 1, n a(i, j) = 0.0 do j = 1, n do i = 1, n a(i, j) = 0.0 access a(1,1), a(1,2), a(2,1), a(2,2),. BUT in FORTRAN arrays are stored in column-wise order: memory a(1,1), a(2,1),..!!! Successive iterations of the inner loop do not access successive locations in memory and when we will access a(2,1), after n- iterations, it could be evicted from the cache Successive references in time are adjacent in memory fully exploitation of spatial locality 9 Quiz 1!$omp parallel do i = 1, 10 print *, Hello world, I!$omp end parallel If you have 4 threads, how many time is Hello world printed? 10 5

Quiz 2!$omp parallel do do i = 1, 10 print *, Hello world, i!$omp end parallel do If you have 4 threads, how many time is Hello world printed? 11 Static vs. Dynamic Schedule Loop scheduling may be static or dynamic Static schedule: the choice of which thread performs a particular iteration is purely a function of the iteration number and number of threads. Each thread performs only the iterations assigned to it at the beginning of the loop load imbalances Dynamic schedule: the assignment of iterations to threads can vary at runtime from one execution to another. Not all the iterations are assigned to the thread at the beginning of the loop. Each thread requires more iterations after it has completed the work already assigned to it synchronization cost 12 6

Scheduling Syntax Schedule (type [, chunk]) type is static, dynamic, guided, or runtime chunk size is the number of iterations a chuck contains 13!$OMP PARALLEL DO &!$OMP SCHEDULE(STATIC,3) DO J = 1, 36 work (j) END DO!$OMP END DO Do Scheduling 1. Iterations are divided into chucks of size 3 2. The chucks are statically assigned to the threads: first thread gets first chuck second thread gets second chuck. 3. Not specified size: default is 1 From http://www.msi.umn.edu/tutorial/scicomp/general/openmp/ 14 7

!$OMP PARALLEL DO &!$OMP SCHEDULE(DYNAMIC,1) DO J = 1, 36 Work (j) END DO!$OMP END DO Do Scheduling 1. Iterations are divided into chucks of size 1 2. The chucks are dynamically assigned to the threads at runtime: first thread gets first chuck second thread gets second chuck.. 3. Not specified size: default is 1 From http://www.msi.umn.edu/tutorial/scicomp/general/openmp/ 15!$OMP PARALLEL DO &!$OMP SCHEDULE(GUIDED) DO J = 1, 36 work (j) END DO!$OMP END DO 1. The first chuck is of some implementation-dependent size 2. Typical initial chunk of N/P 3. The size of the successive chuck decreases exponentially down to a minimum size of chuck (e.g., 4, 2, 1) 4. The chucks are assigned to thread dynamically 5. Not specified size: default is 1 Do Scheduling From http://www.msi.umn.edu/tutorial/scicomp/general/openmp/ 16 8

Programming Shared Memory Systems with OpenMP Part III Instructor Dr. Taufer Data Parallelism #include <stdio.h> main() int i, k = 0, k1; #pragma omp parallel shared(k) private(i,k1) k1 = 0 ; #pragma omp for for (i=1; i <= 1000; i++) k1 += 1; #pragma omp critical k += k1; printf("%d\n", k); Can you explain why this is an example of data parallelism? Help: Consider OMP_NUM_THREADS = 4 to explain the code 18 9

Task (or Functional) Parallelism #pragma omp parallel sections block 1 block 2 #pragma omp parallel s block 1 block 2 19 Example: Task Parallelism alpha beta v = alpha( ); w = beta( ); x = gamma (v, w) y = delta( ); printf ( %6.2f \n, epsilon( x, y); gamma delta epsilon 20 10

Version 1: #pragma omp parallel sections v = alpha( ); w = beta( ); y = delta( ); x = gamma(v, w); printf ( %6.2f \n, epsilon( x, y); What version is better? Help: Assume that you have a dual processor and explain the parallel execution for the two versions Version 2: #pragma omp parallel s v = alpha( ); w = beta( ); s v = gamma(v, w); w = delta( ); printf ( %6.2f \n, epsilon( x, y); 21 void do_physics() #pragma omp parallel sections top_physics(); bottom_physics(); left_physics(); right_physics(); front_physics(); rear_physics(); Can you describe the code behavior? Assume that you have a machine with 6 processors This program is free to completely overlap the computation of the subroutines by distributing them among threads in the team 22 11

#pragma omp parallel private(tid) tid = omp_get_thread_num(); #pragma omp single printf("%d: Starting process_block1\n", tid); process_block1(); #pragma omp single nowait printf("%d: Starting process_block2\n", tid); process_block2(); #pragma omp single printf("%d: All done\n", tid); Can you describe the code behavior? How many print statements do you have per single section? What is the function of nowait? Assume that you have a dual processor. Within the parallel region the print statements are printed only once no matter how many threads executing the statements in the parallel region There is an implied barrier at the end of a single construct. As a result, after one thread executes the print statement, all other threads must "catch up" to the barrier point before they all simultaneously execute the next statements. The nowait clause can be used to eliminate the implied barrier. 23 #pragma omp parallel shared(request_queue) private(request_id,request_status) for (;;) #pragma omp critical (get_request) request_id = get_next_request(request_queue); printf("processing request %d\n", request_id); request_status = process_request (request_id); update_request_status(request_id, request_status); Get_next_requestis called by only one thread at a time, ensuring that each receives a unique request identifier. The critical construct is contained within a parallel construct that identifies request_queue as a shared variable and request_id and request_status as variables private to each thread. 24 12

#pragma omp parallel work_phase1(); #pragma omp barrier exchange_results(); work_phase2(); work_phase1() is executed simultaneously by all threads in the team. As each thread returns from the routine, it waits for all threads to complete work_phase1() prior to calling exchange_results() and executing work_phase2(). In general, barriers should be avoided except where necessary to preserve the integrity of the data environment. Spending valuable time synchronizing threads that could operate completely independently is not a good use of computer time. 25 Deadlines 3/10 Lecture: OpenMP Student: randomly selected Discussion: homework 1 3/12 Lecture: OpenMP 3/17 Discussion: OpenMP article s Students: randomly selected Deadline 2 Seminar presentations Student 1: Wei Yi [17] 3/19 Seminar presentations Student 1: Yuanfang Chen [13] Student 2: Adnan Ozsoy [7] 3/24 Lecture: MD Homework 2 3/26 Lecture: MD Discussion: homework 2 Student: randomly selected 3/31 Semester break no class Deadline 3 26 13