Synchronization. Event Synchronization

Similar documents
Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

Programming Shared Memory Systems with OpenMP Part I. Book

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

Parallel Programming in C with MPI and OpenMP

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Lecture 4: OpenMP Open Multi-Processing

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Introduction to OpenMP

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

Introduction to OpenMP

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

High Performance Computing: Tools and Applications

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Shared Memory Programming Paradigm!

Multithreading in C with OpenMP

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

OpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder

Introduction to OpenMP

[Potentially] Your first parallel application

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Concurrent Programming with OpenMP

Using OpenMP. Rebecca Hartman-Baker Oak Ridge National Laboratory

OpenMP Library Functions and Environmental Variables. Most of the library functions are used for querying or managing the threading environment

Parallel Programming with OpenMP. CS240A, T. Yang

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

Distributed Systems + Middleware Concurrent Programming with OpenMP

COMP Parallel Computing. SMM (2) OpenMP Programming Model

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

EE/CSCI 451: Parallel and Distributed Computation

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel and Distributed Computing

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Introduction to OpenMP. Lecture 4: Work sharing directives

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Introduction to OpenMP

CS691/SC791: Parallel & Distributed Computing

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

Introduction to Standard OpenMP 3.1

Shared Memory programming paradigm: openmp

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Shared Memory Programming with OpenMP

Introduction to OpenMP

cp r /global/scratch/workshop/openmp-wg-oct2017 cd openmp-wg-oct2017 && ls Current directory

ECE 574 Cluster Computing Lecture 10

Shared Memory Parallelism - OpenMP

Parallel Programming

NUMERICAL PARALLEL COMPUTING

Shared Memory Programming Model

Introduc4on to OpenMP and Threaded Libraries Ivan Giro*o

CS 470 Spring Mike Lam, Professor. OpenMP

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

Introduction to. Slides prepared by : Farzana Rahman 1

DPHPC: Introduction to OpenMP Recitation session

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

CS 470 Spring Mike Lam, Professor. OpenMP

15-418, Spring 2008 OpenMP: A Short Introduction

Multi-core Architecture and Programming

Introduction to OpenMP.

Practical in Numerical Astronomy, SS 2012 LECTURE 12

OpenMP Tutorial. Dirk Schmidl. IT Center, RWTH Aachen University. Member of the HPC Group Christian Terboven

CS691/SC791: Parallel & Distributed Computing

Parallel Computing. Prof. Marco Bertini

Synchronisation in Java - Java Monitor

Open Multi-Processing: Basic Course

Introduction to OpenMP

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Multicore Computing. Arash Tavakkol. Instructor: Department of Computer Engineering Sharif University of Technology Spring 2016

OPENMP OPEN MULTI-PROCESSING

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Session 4: Parallel Programming with OpenMP

INTRODUCTION TO OPENMP (PART II)

Computational Mathematics

Amdahl s Law. AMath 483/583 Lecture 13 April 25, Amdahl s Law. Amdahl s Law. Today: Amdahl s law Speed up, strong and weak scaling OpenMP

Introduction to OpenMP

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

Parallel Programming using OpenMP

OpenMP on Ranger and Stampede (with Labs)

OpenMP programming Part II. Shaohao Chen High performance Louisiana State University

Parallel Programming using OpenMP

DPHPC: Introduction to OpenMP Recitation session

Loop Modifications to Enhance Data-Parallel Performance

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

Introduction to OpenMP

Parallel Computing Parallel Programming Languages Hwansoo Han

Transcription:

Synchronization Synchronization: mechanisms by which a parallel program can coordinate the execution of multiple threads Implicit synchronizations Explicit synchronizations Main use of explicit synchronization to control the access to shared objects: Mutual exclusion Event synchronization 1 Event Synchronization This construct is used to signal the occurrence of an event though multiple threads. Eg: Barrier Master Order Explicit synchronizations > event synchronization 2 1

Barrier Directive Threads in a team wait until entire team reaches the barrier!$omp PARALLEL!$OMP DO REDUCTION(+:S) DO I = 1, 100 S = S + F(I) END DO!$OMP END DO NOWAIT.! Wait for all the threads to reach this point!$omp BARRIER PRINT *, S!$OMP END PARALLEL Explicit synchronizations > event synchronization > barrier 3 Master Directive Only the master thread executes the enclosed block of code!$omp PARALLEL!$OMP DO DO i = 1, n complex calculations here ENDO!$OMP MASTER PRINT *, intermediate results!$omp END MASTER continue next calculations.!$omp END PARALLEL Explicit synchronizations > event synchronization > master directive 4 2

Ordered Directive The portion of code within the loop iteration enclosed in a ordered section must be executed in the original, sequential order of the loop iterations!$omp PARALLEL DO ORDERED DO i = 1, n a(i) = complex calculations here! Wait until the previous iteration has! finished its ordered section!$omp ORDERED PRINT *, a(i)! Signal the completion of ordered! from this iterations!$omp END ORDERED END DO Explicit synchronizations > event synchronization > ordered directive 5 Parallel Overhead Master thread has to start the slaves Iterations have to be divided among threads Threads must synchronize at the end of the loop Each iteration of a loop involves a certain amount of work, e.g.: Integer and floating point operations Loads and stores of memory locations Control flow instructions such as subroutine calls and branches 6 3

Reducing Overhead: If Clause if (n.gt. 800) then!$omp parallel do do i = 1, n z(i) = a * x(i) + y endo else do i = 1, n z(i) = a * x(i) + y endo endif!$omp parallel do if (n.gt. 800) do i = 1, n z(i) = a * x(i) + y endo Avoid parallel overhead at low trip-counts 7 do j = 2, n Reducing Overhead: Loop Interchangeable!$omp parallel do do i = 1, n!! Parallelizable a(i, j) = a(i, j) + a(i, j - 1)!! Not parallelizable loop with data!! Dependency Reduce parallel overhead though loop interchangeable!$omp parallel do do i = 1, n do j = 2, n a(i, j) = a(i, j) + a(i, j - 1)!!!! But worse utilization of the memory 8 4

Spatial Locality do i = 1, n do j = 1, n a(i, j) = 0.0 do j = 1, n do i = 1, n a(i, j) = 0.0 access a(1,1), a(1,2), a(2,1), a(2,2),. BUT in FORTRAN arrays are stored in column-wise order: memory a(1,1), a(2,1),..!!! Successive iterations of the inner loop do not access successive locations in memory and when we will access a(2,1), after n- iterations, it could be evicted from the cache Successive references in time are adjacent in memory fully exploitation of spatial locality 9 Quiz 1!$omp parallel do i = 1, 10 print *, Hello world, I!$omp end parallel If you have 4 threads, how many time is Hello world printed? 10 5

Quiz 2!$omp parallel do do i = 1, 10 print *, Hello world, i!$omp end parallel do If you have 4 threads, how many time is Hello world printed? 11 Static vs. Dynamic Schedule Loop scheduling may be static or dynamic Static schedule: the choice of which thread performs a particular iteration is purely a function of the iteration number and number of threads. Each thread performs only the iterations assigned to it at the beginning of the loop load imbalances Dynamic schedule: the assignment of iterations to threads can vary at runtime from one execution to another. Not all the iterations are assigned to the thread at the beginning of the loop. Each thread requires more iterations after it has completed the work already assigned to it synchronization cost 12 6

Scheduling Syntax Schedule (type [, chunk]) type is static, dynamic, guided, or runtime chunk size is the number of iterations a chuck contains 13!$OMP PARALLEL DO &!$OMP SCHEDULE(STATIC,3) DO J = 1, 36 work (j) END DO!$OMP END DO Do Scheduling 1. Iterations are divided into chucks of size 3 2. The chucks are statically assigned to the threads: first thread gets first chuck second thread gets second chuck. 3. Not specified size: default is 1 From http://www.msi.umn.edu/tutorial/scicomp/general/openmp/ 14 7

!$OMP PARALLEL DO &!$OMP SCHEDULE(DYNAMIC,1) DO J = 1, 36 Work (j) END DO!$OMP END DO Do Scheduling 1. Iterations are divided into chucks of size 1 2. The chucks are dynamically assigned to the threads at runtime: first thread gets first chuck second thread gets second chuck.. 3. Not specified size: default is 1 From http://www.msi.umn.edu/tutorial/scicomp/general/openmp/ 15!$OMP PARALLEL DO &!$OMP SCHEDULE(GUIDED) DO J = 1, 36 work (j) END DO!$OMP END DO 1. The first chuck is of some implementation-dependent size 2. Typical initial chunk of N/P 3. The size of the successive chuck decreases exponentially down to a minimum size of chuck (e.g., 4, 2, 1) 4. The chucks are assigned to thread dynamically 5. Not specified size: default is 1 Do Scheduling From http://www.msi.umn.edu/tutorial/scicomp/general/openmp/ 16 8

Programming Shared Memory Systems with OpenMP Part III Instructor Dr. Taufer Data Parallelism #include <stdio.h> main() int i, k = 0, k1; #pragma omp parallel shared(k) private(i,k1) k1 = 0 ; #pragma omp for for (i=1; i <= 1000; i++) k1 += 1; #pragma omp critical k += k1; printf("%d\n", k); Can you explain why this is an example of data parallelism? Help: Consider OMP_NUM_THREADS = 4 to explain the code 18 9

Task (or Functional) Parallelism #pragma omp parallel sections block 1 block 2 #pragma omp parallel s block 1 block 2 19 Example: Task Parallelism alpha beta v = alpha( ); w = beta( ); x = gamma (v, w) y = delta( ); printf ( %6.2f \n, epsilon( x, y); gamma delta epsilon 20 10

Version 1: #pragma omp parallel sections v = alpha( ); w = beta( ); y = delta( ); x = gamma(v, w); printf ( %6.2f \n, epsilon( x, y); What version is better? Help: Assume that you have a dual processor and explain the parallel execution for the two versions Version 2: #pragma omp parallel s v = alpha( ); w = beta( ); s v = gamma(v, w); w = delta( ); printf ( %6.2f \n, epsilon( x, y); 21 void do_physics() #pragma omp parallel sections top_physics(); bottom_physics(); left_physics(); right_physics(); front_physics(); rear_physics(); Can you describe the code behavior? Assume that you have a machine with 6 processors This program is free to completely overlap the computation of the subroutines by distributing them among threads in the team 22 11

#pragma omp parallel private(tid) tid = omp_get_thread_num(); #pragma omp single printf("%d: Starting process_block1\n", tid); process_block1(); #pragma omp single nowait printf("%d: Starting process_block2\n", tid); process_block2(); #pragma omp single printf("%d: All done\n", tid); Can you describe the code behavior? How many print statements do you have per single section? What is the function of nowait? Assume that you have a dual processor. Within the parallel region the print statements are printed only once no matter how many threads executing the statements in the parallel region There is an implied barrier at the end of a single construct. As a result, after one thread executes the print statement, all other threads must "catch up" to the barrier point before they all simultaneously execute the next statements. The nowait clause can be used to eliminate the implied barrier. 23 #pragma omp parallel shared(request_queue) private(request_id,request_status) for (;;) #pragma omp critical (get_request) request_id = get_next_request(request_queue); printf("processing request %d\n", request_id); request_status = process_request (request_id); update_request_status(request_id, request_status); Get_next_requestis called by only one thread at a time, ensuring that each receives a unique request identifier. The critical construct is contained within a parallel construct that identifies request_queue as a shared variable and request_id and request_status as variables private to each thread. 24 12

#pragma omp parallel work_phase1(); #pragma omp barrier exchange_results(); work_phase2(); work_phase1() is executed simultaneously by all threads in the team. As each thread returns from the routine, it waits for all threads to complete work_phase1() prior to calling exchange_results() and executing work_phase2(). In general, barriers should be avoided except where necessary to preserve the integrity of the data environment. Spending valuable time synchronizing threads that could operate completely independently is not a good use of computer time. 25 Deadlines 3/10 Lecture: OpenMP Student: randomly selected Discussion: homework 1 3/12 Lecture: OpenMP 3/17 Discussion: OpenMP article s Students: randomly selected Deadline 2 Seminar presentations Student 1: Wei Yi [17] 3/19 Seminar presentations Student 1: Yuanfang Chen [13] Student 2: Adnan Ozsoy [7] 3/24 Lecture: MD Homework 2 3/26 Lecture: MD Discussion: homework 2 Student: randomly selected 3/31 Semester break no class Deadline 3 26 13