Synchronous Computations

Size: px

Start display at page:

Download "Synchronous Computations"

Miles Robbins
6 years ago
Views:

1 Chapter 6 slides6-1 Synchronous Computations

2 Synchronous Computations slides6-2 In a (fully) synchronous application, all the processes synchronized at regular points. Barrier A basic mechanism for synchronizing processes - inserted at the point in each process where it must wait. All processes can continue from this point when all the processes have reached it (or, in some implementations, when a stated number of processes have reached this point).

3 Processes reaching barrier at different times slides6-3 Processes P 0 P 1 P 2 P p 1 Active Time Waiting Barrier

4 In message-passing systems, barriers provided with library routines: Processes slides6-4 P 0 P 1 P p 1 Barrier(); Processes wait until all reach their barrier call Barrier(); Barrier();

5 slides6-5 MPI MPI_Barrier() Barrier with a named communicator being the only parameter. Called by each process in the group, blocking until all members of the group have reached the barrier call and only returning then. similar barrier routine used with a named group of processes.

6 Barrier Implementation slides6-6 Centralized counter implementation (a linear barrier): Processes Counter, C P 0 P 1 P p-1 Increment and check for p Barrier(); Barrier(); Barrier();

7 slides6-7 Good barrier implementations must take into account that a barrier might be used more than once in a process. Might be possible for a process to enter the barrier for a second time before previous processes have left the barrier for the first time.

8 slides6-8 Counter-based barriers often have two phases: A process enters arrival phase and does not leave this phase until all processes have arrived in this phase. Then processes move to departure phase and are released. Two-phase handles the reentrant scenario.

9 slides6-9 Example code: Master: for (i = 0; i < n; i++)/*count slaves as they reach barrier*/ recv(p any ); for (i = 0; i < n; i++)/* release slaves */ send(p i ); Slave processes: send(p master ); recv(p master );

10 slides6-10 Barrier implementation in a message-passing system Master Slave processes Arrival phase Departure phase for(;i<n;i++) recv(p any ); for(;i<n;i++) send(p i ); Barrier: send(p master ); recv(p master ); Barrier: send(p master ); recv(p master );

11 Tree Implementation More efficient. O(log p) steps Suppose 8 processes, P 0, P 1, P 2, P 3, P 4, P 5, P 6, P 7 : slides6-11 1st stage: 2nd stage: 3rd stage: P 1 sends message to P 0 ; (when P 1 reaches its barrier) P 3 sends message to P 2 ; (when P 3 reaches its barrier) P 5 sends message to P 4 ; (when P 5 reaches its barrier) P 7 sends message to P 6 ; (when P 7 reaches its barrier) P 2 sends message to P 0 ; (P 2 & P 3 reached their barrier) P 6 sends message to P 4 ; (P 6 & P 7 reached their barrier) P 4 sends message to P 0 ; (P 4, P 5, P 6, & P 7 reached barrier) P 0 terminates arrival phase; (when P 0 reaches barrier & received message from P 4 ) Release with a reverse tree construction.

12 Tree barrier slides6-12 P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 Arrival at barrier Sychronizing message Departure from barrier

13 Butterfly Barrier slides6-13 1st stage P 0 P 1, P 2 P 3, P 4 P 5, P 6 P 7 2nd stage P 0 P 2, P 1 P 3, P 4 P 6, P 5 P 7 3rd stage P 0 P 4, P 1 P 5, P 2 P 6, P 3 P 7 P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 1st stage 2nd stage Time 3rd stage

14 Local Synchronization slides6-14 Suppose a process P i needs to be synchronized and to exchange data with process P i 1 and process P i+1 before continuing: Process P i-1 Process P i Process P i+1 recv(p i ); send(p i-1 ); recv(p i ); send(p i ); send(p i+1 ); send(p i ); recv(p i-1 ); recv(p i+1 ); Not a perfect three-process barrier because process P i 1 will only synchronize with P i and continue as soon as P i allows. Similarly, process P i+1 only synchronizes with P i.

15 slides6-15 Deadlock When a pair of processes each send and receive from each other, deadlock may occur. Deadlock will occur if both processes perform the send, using synchronous routines first (or blocking routines without sufficient buffering). This is because neither will return; they will wait for matching receives that are never reached.

16 slides6-16 A Solution Arrange for one process to receive first and then send and the other process to send first and then receive. Example Linear pipeline, deadlock can be avoided by arranging so the evennumbered processes perform their sends first and the oddnumbered processes perform their receives first.

17 Combined deadlock-free blocking sendrecv() routines slides6-17 Example Process P i-1 Process P i Process P i+1 sendrecv(p i ); sendrecv(p i-1 ); sendrecv(p i+1 ); sendrecv(p i ); MPI provides MPI_Sendrecv()and MPI_Sendrecv_replace(). MPI sendrev()s actually has 12 parameters!

18 Synchronized Computations slides6-18 Can be classififed as: Fully synchronous or Locally synchronous In fully synchronous, all processes involved in the computation must be synchronized. In locally synchronous, processes only need to synchronize with a set of logically nearby processes, not all processes involved in the computation

19 slides6-19 Fully Synchronized Computation Examples Data Parallel Computations Same operation performed on different data elements simultaneously; i.e., in parallel. Particularly convenient because: Ease of programming (essentially only one program). Can scale easily to larger problem sizes. Many numeric and some non-numeric problems can be cast in a data parallel form.

20 slides6-20 Example To add the same constant to each element of an array: for (i = 0; i < n; i++) a[i] = a[i] + k; The statement: a[i] = a[i] + k; could be executed simultaneously by multiple processors, each using a different index i (0 < i n).

21 Data Parallel Computation slides6-21 Instruction a[] = a[] + k; Processors a[0]=a[0]+k; a[1]=a[1]+k; a[n-1]=a[n-1]+k; a[0] a[1] a[n-1]

22 slides6-22 forall construct Special parallel construct in parallel programming languages to specify data parallel operations Example forall (i = 0; i < n; i++) { } body states that n instances of the statements of the body can be executed simultaneously. One value of the loop variable i is valid in each instance of the body, the first instance has i = 0, the next i = 1, and so on.

23 slides6-23 To add k to each element of an array, a, we can write forall (i = 0; i < n; i++) a[i] = a[i] + k;

24 slides6-24 Data parallel technique applied to multiprocessors and multicomputers Example To add k to the elements of an array: i = myrank; a[i] = a[i] + k;/* body */ barrier(mygroup); where myrank is a process rank between 0 and n 1.

25 slides6-25 Data Parallel Example - Prefix Sum Problem Given a list of numbers, x 0,, x n 1, compute all the partial summations (i.e., x 0 + x 1 ; x 0 + x 1 + x 2 ; x 0 + x 1 + x 2 + x 3 ; ). Can also be defined with associative operations other than addition. Widely studied. Practical applications in areas such as processor allocation, data compaction, sorting, and polynomial evaluation.

26 slides6-26 Data parallel method of adding all partial sums of 16 numbers

27 Data parallel prefix sum operation slides6-27 Numbers x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 Step 1 (j = 0) i=1 3 i=2 4 i=3 5 i=4 6 i=5 7 i=6 8 i=7 9 i=8 10 i=9 11 i=10 12 i=11 13 i=12 14 i=13 15 i=14 Add Step 2 (j = 1) i=1 5 i=2 6 i=3 7 i=4 8 i=5 9 i=6 10 i=7 11 i=8 12 i=9 13 i=10 14 i=11 15 i=12 Add Step 3 (j = 2) i=1 9 i=2 10 i=3 11 i=4 12 i=5 13 i=6 14 i=7 Add 15 i=8 Final step (j = 3) Add 15

28 slides6-28 Sequential code for (j = 0; j < log(n); j++)/* at each step, add*/ for (i = 2 j ; i < n; i++)/* to accumulating sum */ x[i] = x[i] + x[i - 2 j ]; Parallel code for (j = 0; j < log(n); j++) /* at each step, add */ forall (i = 0; i < n; i++)/*to sum */ if (i >= 2 j ) x[i] = x[i] + x[i - 2 j ];

Synchronous Computations

Chapter 6 slides6-1 Synchronous Computations Synchronous Computations slides6-2 In a (fully) synchronous application, all the processes synchronized at regular points. Barrier A basic mechanism for synchronizing