EE/CSCI 451: Parallel and Distributed Computation

Size: px

Start display at page:

Download "EE/CSCI 451: Parallel and Distributed Computation"

Isaac Morgan
5 years ago
Views:

1 EE/CSCI 451: Parallel and Distributed Computation Lecture #15 3/7/2017 Xuehai Qian University of Southern California 1

2 From last class Outline Data distribution Mapping Parallel algorithm models Today (Chapter 6) Message passing Send and receive operations Examples, performance issues 2

3 Message Passing Programming Model (1) Message passing One of the oldest parallel programming paradigms Widely used Key features Partition address space local data, remote data Explicit parallelization user is responsible to specify and manage concurrency Can be challenging 3

4 Message Passing Programming Model (2) Explicit communication Program 0 Program 1 Program p-1 Data local to program 0 Data local to program 1 Data local to program p-1 Program address space partitioned across the programs Communication - needs coordination among the communicating processes (and the host for the two processes) 4

structure with respect to instructions, interactions No global clock

5 Message Passing Program (1) Most General Model: Asynchronous Program 0 Program 1 Program 2 R S R S R S S R S = send R = receive End End End No structure with respect to instructions, interactions No global clock Execution is asynchronous Programs 0,1,, p 1 can be all distinct Hard to write/debug 5

6 Message Passing Program (2) Loosely synchronous Program 0 Program 1 Program 2 Program 3 Program 4 Receive data sync sync sync Some structure Easier to reason about than asynchronous execution model 6

7 Message Passing Program (3) SPMD (Single Program Multiple Data) Code is same in all the processes except for initialization Restrictive model, easy to write and debug Widely used In all 3 cases (models of concurrency) correctness Irrespective of the rate of execution of each program, should produce the correct results for every input data as intended 7

Message Passing Program Specification User

mapping that reduces the cost of sending and

8 Message Passing Program Specification User specifies: Processes Process layout Data layout 1-D 0 1 p 1 Embedding: Specified by user or MPI system software finds the most appropriate mapping that reduces the cost of sending and receiving messages 8 2-D 0 1 K 1 ( p 1, p 1) Target Platform

9 Send and Receive (1) Send and Receive operations P 0 data P 1 Send ( Destination process ID sendbuf, size, dest ) size Receive ( recvbuf, Source process ID size, source ) Send data from process 0 to process 1 Sent data = data at the beginning of the execution of send Send and receive should be matched (for ex. Use process IDs) Complications may arise due to the way the software and hardware implement the operation 9

10 Send and Receive (2) What data is sent? Buffered? Issues Sending process: wait until completion of communication? Overheads at sender, at receiver 10

11 Adding Using Message Passing (1) Start with adding on PRAM Output = : A i in A(0) A(0) A(n 1) 11

12 PRAM Algorithm Program in processor j, 0 j n Adding Using Message Passing(2) Do i = 0 to log A n 1 If j = k D 2 8F7, for some k N A j A j + A(j ) A(0) 3. end Note: A(n 1) A is shared among all the processors Synchronous operation [For ex. all the processors execute instruction 2 during the same cycle, log A n time] N = set of natural numbers = {0, 1, } Parallel time = O(log n) cycles 12

13 Adding using Message Passing (3) Message Passing Algorithm (SPMD model) Program in process j, 0 j n 1 1. Do i = 0 to log A n End If j = k D 2 8F7 +2 8, for some k N Send A(j) to process j 2 i Else if j = k D 2 8F7, for some k N End Receive A(j + 2 i ) from process j + 2 i A j A j + A(j ) Barrier Note: A(j) is local to process j N = set of natural numbers = {0, 1, } Parallel time = O(log n) iterations 13

14 Adding using Message Passing (4) Communication between processes Power of 2 connections e.g. Hypercube Total amount of communication = O(n) 14

15 MM using Message Passing (1) C A B Cannon s algorithm n n matrixes p p processors, P 8U 0 i, j < p, 1 p n Processor P i,j assigned to A i,j, B i,j, C i,j (i, j)th block of size 5 W 5 W 15

16 MM using Message Passing (2) Circular left shift 0 1 p-1 Circular up shift 0 p-1 16

17 MM using Message Passing (3) Initial data alignment For A: i Z[ row circular left shift by i (0 i < p) For B: j Z[ column circular up shift by j 0 j < p 4 4 matrix 4 4 processor array A 0,0 B 0,0 A 1,1 B 1,0 A 2,2 B 2,0 A 3,3 B 3,0 A 0,1 B 1,1 A 1,2 B 2,1 A 2,3 B 3,1 A 3,0 B 0,1 A 0,2 B 2,2 A 1,3 B 3,2 A 2,0 B 0,2 A 3,1 B 1,2 A and B after initial alignment A 0,3 B 3,3 A 1,0 B 0,3 A 2,1 B 1,3 A 3,2 B 2,3 17

18 Super step MM using Message Passing (4) 1. Initial data alignment 2. Repeat p times Parallel algorithm (global view) Ø All processors P 8,U perform 5 5 matrix multiplication in parallel using local W W data Ø In parallel for all i, j Processor P 8,U : circular left shift a ( 5 5 ) by 1 position W W Ø In parallel for all i, j Processor P 8,U : End Note: a, b, c : 5 W 5 W circular up shift b ( 5 5 ) by 1 position W W matrices, local to each processor Data alignment using message passing (permutation in each row and each col) 18

19 MM using Message Passing (5) Parallel algorithm (local view from P i,j ) Repeat p times Ø c c + a b 5 W 5 W matrix multiplication Super step Ø a read from right neighbor from {i, (j + 1) mod p} Ø b read from neighbor below from {(i + 1) mod p, j} End 19

20 Illustration (4 4 matrix, 4 4 processor array) A 0,0 A 1,1 MM using Message Passing (6) A 0,1 A 1,2 A 0,2 A 1,3 Cannon s algorithm A 0,3 B 0,0 B 1,1 B 2,2 B 3,3 A 1,0 B 1,0 B 2,1 B 3,2 B 0,3 Initial alignment Super step 0 Compute using local data Circular left shift A Circular up shift B A 2,2 A 2,3 A 2,0 A 2,1 B 2,0 B 3,1 B 0,2 B 1,3 A 3,3 A 3,0 A 3,1 A 3,2 B 3,0 B 0,1 B 1,2 B 2,3 20

21 A 0,1 A 1,2 A 2,3 MM using Message Passing (7) A 0,2 A 1,3 A 2,0 A 0,3 A 1,0 A 2,1 Cannon s algorithm A 0,0 B 1,0 B 2,1 B 3,2 B 0,3 A 1,1 B 2,0 B 3,1 B 0,2 B 1,3 A 2,2 B 3,0 B 0,1 B 1,2 B 2,3 Initial alignment Super step 0 Compute using local data Circular left shift A Circular up shift B Super step 1 Compute using local data Circular left shift A Circular up shift B A 3,0 A 3,1 A 3,2 A 3,3 B 0,0 B 1,1 B 2,2 B 3,3 21

22 A 0,2 A 1,3 A 2,0 A 3,1 MM using Message Passing (8) A 0,3 A 1,0 A 2,1 A 3,2 A 0,0 A 1,1 A 2,2 A 3,3 Cannon s algorithm A 0,1 B 2,0 B 3,1 B 0,2 B 1,3 A 1,2 B 3,0 B 0,1 B 1,2 B 2,3 A 2,3 B 0,0 B 1,1 B 2,2 B 3,3 A 3,0 B 1,0 B 2,1 B 3,2 B 0,3 Initial alignment Super step 0 Compute using local data Circular left shift A Circular up shift B Super step 1 Compute using local data Circular left shift A Circular up shift B Super step 2 Compute using local data Circular left shift A Circular up shift B 22

23 A 0,3 A 1,0 A 2,1 A 3,2 MM using Message Passing (9) A 0,0 A 1,1 A 2,2 A 3,3 A 0,1 Cannon s algorithm A 0,2 B 3,0 B 0,1 B 1,2 B 2,3 A 1,2 A 1,3 B 0,0 B 1,1 B 2,2 B 3,3 A 2,3 A 3,0 A 2,0 B 1,0 B 2,1 B 3,2 B 0,3 A 3,1 B 2,0 B 3,1 B 0,2 B 1,3 23 Initial alignment Super step 0 Compute using local data Circular left shift A Circular up shift B Super step 1 Compute using local data Circular left shift A Circular up shift B Super step 2 Compute using local data Circular left shift A Circular up shift B Super step 3 Compute using local data

24 MM using Message Passing (10) Performance analysis Total number of multiply and add operations in each super step (in each PE): ( 5 W )` multiplications and ( 5 W )` additions Total number of super steps: p Total number of operations (over all the PEs): ( 5 W )` p p p = n` multiplications ( 5 W )` p p p = n` additions Number of super steps Total amount of data communicated (data received) = p D (2 5 W Number of processes 5 ) D p = W O(nA D p) Number of super steps 24 Number of processors

25 MM using Shared Variable (1) C A B n n Each thread i, j is responsible to update C(i, j), 0 i, j < n A and B are shared variables 25

26 MM using Shared Variable (2) Thread i, j C i, j 0 Do k from 0 to n 1 C i, j C i, j + A i, k B k, j End Shared Memory Threads 26

27 Blocking Send/Receive Blocking semantics Data sent = data at the time the Send command was initiated To ensure correctness, block the send operation till some condition to ensure semantics of send Blocking non-buffered send Block send process Send request to receiving process Wait for receiving process to acknowledge (matched receive operation) Upon receiving acknowledgement, start the transfer No buffers 27

28 Sending process Receiving process Blocking Send/Receive Idling overheads Sending process Receiving process Sending process Receiving process Send Idle Request to send okay to send data Recv. Send Request to send okay data Recv. Send Request to send okay data Idle Recv. (a) Sender comes first; idling at sender (b) Sender and receiver come at about the same time; idling minimized 28 (c) Receiver comes first; idling at receiver

29 Deadlock (1) Example (1) blocked P : 1. send(&a, 1,1); 2. receive(&b, 1,1); P 7 1. send(&a, 1,0); 2. receive(&b, 1,0); Deadlocks are very easy in blocking protocols 29

30 Deadlock (2) Example (2) If myid = even Send Receive P : P 7 Send Receive Receive Send If myid = odd Receive Send 30

31 Non-blocking Send/Receive (1) Non-blocking send/receive Fast send/receive (reduce overhead) Let the programmer manage semantic correctness Send: Perform simple initiation, setup Return control immediately User should not alter data immediately after issuing send. However, user can do other (useful) operations Status information available for user to check Example: check-status 31

32 Non-blocking Send/Receive (2) Non-blocking send/receive Sending process Receiving process Copy data into buffer Continue execution; Unsafe to update sent data Send Request to send okay to send Receive Finish copying; Safe to update sent data data 32

33 Summary of Blocking Send Non-buffered Data sent = data at the time the Send command was initiated Issue send request and block sending process Start data transfer after receiving acknowledgement from receiving process Return control to sending process after communication completion Eg. Receiving process has received the entire data 33

34 Summary of Non-Blocking Send Data sent = data at the time the Send command was initiated Copy data into send buffer then return control to sending process immediately User can alter sent data after they have been copied into buffer 34

35 Additional Materials in Textbook These are not required for this class Non-blocking non-buffered send/receive operations with communication hardware support Non-blocking non-buffered send/receive operations without communication hardware support 35

36 OpenMP or MPI? MPI Interconnection Network OpenMP Node Node Multicore Shared-Memory 36

37 Summary Send and Receive operations Blocking / non-blocking Issues Overhead Performance Correctness Deadlock 37 3/6/18 37

38 Backup Slides 38

Protocols For Send/Receive Blocking Operations Non-Blocking Operations Buffered Sending process returns after data has been copied into communication buffer Sending process returns after initiating

39 Protocols For Send/Receive Blocking Operations Non-Blocking Operations Buffered Sending process returns after data has been copied into communication buffer Sending process returns after initiating DMA transfer to buffer. This operation may not be completed on return Non-Buffered Sending process blocks until matching receive operation has been encountered Send and Receive semantics assured by corresponding operation Programmer must explicitly ensure semantics by polling to verify completion 39

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Announcements PA #1