Parallel Numerical Algorithms

Size: px

Start display at page:

Download "Parallel Numerical Algorithms"

Leslie Gibbs
5 years ago
Views:

1 Parallel Numerical Algorithms [ 5 ] MPI: Message Passing Interface Parallel Numerical Algorithms / IST / UTokyo 1

2 PNA16 Lecture Plan General Topics 1 Architecture and Performance 2 Dependency 3 Locality 4 Scheduling MIMD / Distributed Memory 5 MPI: Message Passing Interface 6 Collective Communication 7 Distributed Data Structure MIMD / Shared Memory 8 OpenMP 9 Cache Performance Special Lectures 5/30 How to use FX10 (Prof Ohshima) 6/6 Dynamic Parallelism (Prof Peri) SIMD / Shared Memory 10 GPU and CUDA 11 SIMD Performance Parallel Numerical Algorithms / IST / UTokyo 2

3 Memory Models Distributed memory Network Proc Proc Proc Proc Memory Memory Memory Memory Shared memory Uniform Memory Access (UMA) Non Uniform Memory Access (NUMA) Proc Proc Proc Proc Proc Proc Proc Proc Memory Mem Mem Mem Mem Parallel Numerical Algorithms / IST / UTokyo 3

4 Message Passing start send(data, 1) recv(data, 1) end start recv(data, 0) send(data, 0) send(data, 2) recv(data, 2) end start send(data, 3) recv(data, 3) recv(data, 1) send(data, 1) end start recv(data, 2) send(data, 2) end Note 1: send and receive pair establishes data transfer Note 2: source or destination is specified Parallel Numerical Algorithms / IST / UTokyo 4

5 Local View Program describes the processing done by each (only one) process start send(data, 1) recv(data, 1) end start recv(data, 0) send(data, 0) send(data, 2) recv(data, 2) end start send(data, 3) recv(data, 3) recv(data, 1) send(data, 1) end start recv(data, 2) send(data, 2) end Parallel Numerical Algorithms / IST / UTokyo 5

6 SPMD Single Program Multiple Data One program describes all local processing start send(data, 1) recv(data, 1) end start recv(data, 0) send(data, 0) send(data, 2) recv(data, 2) end start send(data, 3) recv(data, 3) recv(data, 1) send(data, 1) end Assume: myid represents ID number of the process nproc represents number of the running processes if (myid % 2 == 0) { send(data, myid + 1); recv(data, myid + 1); else { start recv(data, 2) send(data, 2) Parallel Numerical Algorithms / IST / UTokyo 6 end recv(data, myid 1); send(data, myid 1); if (myid!= 0 && myid!= nproc 1) { if (myid % 2 == 0) { recv(data, myid 1); send(data, myid 1); else { send(data, myid + 1); recv(data, myid + 1);

7 Basic MPI Terms Rank ID number of a process From 0 to (number of processors) 1 Communicator Group of communicating processes MPI_COMM_WORLD: the set of all processes Communicator Size: number of processes Buffer Memory area that contains / stores data Specified by pointers Parallel Numerical Algorithms / IST / UTokyo 7

8 Short and complete MPI code #include <stdioh> #include <mpih> int main(int argc, char **argv) { MPI_Init(&argc, &argv); int myid, nproc; MPI_Comm_rank(MPI_COMM_WORLD, &myid); MPI_Comm_size(MPI_COMM_WORLD, &nproc); int mydata = 2 * myid + 1; printf("i am %d, mydata = %d n", myid, mydata); int recvdata; MPI_Status stat; if (myid == 1) { MPI_Send(&mydata, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); else if (myid == 0) { MPI_Recv(&recvdata, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &stat); printf("sum = %d n", mydata + recvdata); MPI_Finalize(); return 0; Parallel Numerical Algorithms / IST / UTokyo 8

9 Short and complete MPI code #include <stdioh> #include <mpih> Include header file int main(int argc, char **argv) { MPI_Init(&argc, &argv); int myid, nproc; MPI_Comm_rank(MPI_COMM_WORLD, &myid); MPI_Comm_size(MPI_COMM_WORLD, &nproc); int mydata = 2 * myid + 1; printf("i am %d, mydata = %d n", myid, mydata); int recvdata; MPI_Status stat; if (myid == 1) { MPI_Send(&mydata, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); else if (myid == 0) { MPI_Recv(&recvdata, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &stat); printf("sum = %d n", mydata + recvdata); MPI_Finalize(); return 0; Parallel Numerical Algorithms / IST / UTokyo 9

10 Short and complete MPI code #include <stdioh> #include <mpih> int main(int argc, char **argv) { MPI_Init(&argc, &argv); Initialize in this form int myid, nproc; MPI_Comm_rank(MPI_COMM_WORLD, &myid); MPI_Comm_size(MPI_COMM_WORLD, &nproc); int mydata = 2 * myid + 1; printf("i am %d, mydata = %d n", myid, mydata); int recvdata; MPI_Status stat; if (myid == 1) { MPI_Send(&mydata, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); else if (myid == 0) { MPI_Recv(&recvdata, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &stat); printf("sum = %d n", mydata + recvdata); MPI_Finalize(); return 0; Parallel Numerical Algorithms / IST / UTokyo 10

11 Short and complete MPI code #include <stdioh> #include <mpih> int main(int argc, char **argv) { MPI_Init(&argc, &argv); int myid, nproc; MPI_Comm_rank(MPI_COMM_WORLD, &myid); MPI_Comm_size(MPI_COMM_WORLD, &nproc); Get myid and nproc int mydata = 2 * myid + 1; printf("i am %d, mydata = %d n", myid, mydata); int recvdata; MPI_Status stat; if (myid == 1) { MPI_Send(&mydata, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); else if (myid == 0) { MPI_Recv(&recvdata, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &stat); printf("sum = %d n", mydata + recvdata); MPI_Finalize(); return 0; Parallel Numerical Algorithms / IST / UTokyo 11

12 Short and complete MPI code #include <stdioh> #include <mpih> int main(int argc, char **argv) { MPI_Init(&argc, &argv); int myid, nproc; MPI_Comm_rank(MPI_COMM_WORLD, &myid); MPI_Comm_size(MPI_COMM_WORLD, &nproc); int mydata = 2 * myid + 1; printf("i am %d, mydata = %d n", myid, mydata); int recvdata; MPI_Status stat; Make some data and print it out if (myid == 1) { MPI_Send(&mydata, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); else if (myid == 0) { MPI_Recv(&recvdata, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &stat); printf("sum = %d n", mydata + recvdata); MPI_Finalize(); return 0; Parallel Numerical Algorithms / IST / UTokyo 12

13 Short and complete MPI code #include <stdioh> #include <mpih> int main(int argc, char **argv) { MPI_Init(&argc, &argv); int myid, nproc; MPI_Comm_rank(MPI_COMM_WORLD, &myid); MPI_Comm_size(MPI_COMM_WORLD, &nproc); int mydata = 2 * myid + 1; printf("i am %d, mydata = %d n", myid, mydata); int recvdata; MPI_Status stat; Data structure for receive if (myid == 1) { MPI_Send(&mydata, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); else if (myid == 0) { MPI_Recv(&recvdata, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &stat); printf("sum = %d n", mydata + recvdata); MPI_Finalize(); return 0; Parallel Numerical Algorithms / IST / UTokyo 13

14 Short and complete MPI code #include <stdioh> #include <mpih> int main(int argc, char **argv) { MPI_Init(&argc, &argv); int myid, nproc; MPI_Comm_rank(MPI_COMM_WORLD, &myid); MPI_Comm_size(MPI_COMM_WORLD, &nproc); int mydata = 2 * myid + 1; printf("i am %d, mydata = %d n", myid, mydata); int recvdata; MPI_Status stat; if (myid == 1) { MPI_Send(&mydata, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); else if (myid == 0) { MPI_Recv(&recvdata, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &stat); printf("sum = %d n", mydata + recvdata); Send a data MPI_Finalize(); return 0; Parallel Numerical Algorithms / IST / UTokyo 14

15 Short and complete MPI code #include <stdioh> #include <mpih> int main(int argc, char **argv) { MPI_Init(&argc, &argv); int myid, nproc; MPI_Comm_rank(MPI_COMM_WORLD, &myid); MPI_Comm_size(MPI_COMM_WORLD, &nproc); int mydata = 2 * myid + 1; printf("i am %d, mydata = %d n", myid, mydata); int recvdata; MPI_Status stat; if (myid == 1) { MPI_Send(&mydata, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); else if (myid == 0) { MPI_Recv(&recvdata, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &stat); printf("sum = %d n", mydata + recvdata); Receive a data MPI_Finalize(); return 0; Parallel Numerical Algorithms / IST / UTokyo 15

16 Short and complete MPI code #include <stdioh> #include <mpih> int main(int argc, char **argv) { MPI_Init(&argc, &argv); int myid, nproc; MPI_Comm_rank(MPI_COMM_WORLD, &myid); MPI_Comm_size(MPI_COMM_WORLD, &nproc); int mydata = 2 * myid + 1; printf("i am %d, mydata = %d n", myid, mydata); int recvdata; MPI_Status stat; if (myid == 1) { MPI_Send(&mydata, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); else if (myid == 0) { MPI_Recv(&recvdata, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &stat); printf("sum = %d n", mydata + recvdata); MPI_Finalize(); Finalize: must be called return 0; Parallel Numerical Algorithms / IST / UTokyo 16

17 MPI-C at a glance Parallel Numerical Algorithms / IST / UTokyo 17

18 MPI example: cmpsum Note: this function is provided by MPI_AllReduce Butterfly algorithm Only applicable to powers of 2 Parallel Numerical Algorithms / IST / UTokyo 18

19 Message transfer protocols Eager Protocol Data is sent without waiting matching receive call Rendez-vous Protocol Data is sent after matching receive is called Eager Protocol Rendez-vous Protocol src dst src dst user user user buffer address user system system system system Parallel Numerical Algorithms / IST / UTokyo 19

20 Message transfer protocols Eager Protocol First sent to system buffer, then copy to user area No wait for matching receive No dead lock Fast for small messages System to user data copy: slow for large messages Rendez-vous Protocol First receive buffer address is send, then transfer Need wait for matching receive Deadlock may happen Slow for small messages Direct data transfer: fast for large messages Parallel Numerical Algorithms / IST / UTokyo 20

21 MPI_Isend & MPI_Irecv Non-blocking Communication MPI_Isend and MPI_Irecv Returns at once (without waiting for the completion of the data transfer) Must call MPI_Wait for completion Warning: after MPI_Isend and before MPI_Wait, you must not modify the buffer Any combinations, eg Send-Irecv and Isend-Recv are OK Parallel Numerical Algorithms / IST / UTokyo 21

22 BREAK Parallel Numerical Algorithms / IST / UTokyo 22

23 MPI example: stencil Heat equation (dissipation) = κκ 2 uu xx 2 Finite Difference approximation uu ii,kk uu(iiδxx, kkδtt) uu ii,kk+1 uu ii,kk Δtt = κκ uu ii+1,kk 2uu ii,kk + uu ii 1,kk Δxx 2 r = kappa * delta_t / (delta_x * delta_x); u[i][k+1] = r * u[i+1][k] + (1 2* r) u[i][k] + r * u[i-1][k]; Parallel Numerical Algorithms / IST / UTokyo 23

24 MPI example: stencil 0 1 b b+1 2b 2b+1 n n+1 Allocated memory for rank 0 Compute: 3 elements Allocation: 5 elements Parallel Numerical Algorithms / IST / UTokyo 24

25 MPI example: stencil 0 1 b b+1 2b 2b+1 n n+1 Allocated memory for rank 1 Compute: 3 elements Allocation: 5 elements Parallel Numerical Algorithms / IST / UTokyo 25

26 MPI example: stencil 0 1 b b+1 2b 2b+1 n n+1 (n = 3b) Compute: 3 elements Allocation: 5 elements Allocated memory for rank 2 Parallel Numerical Algorithms / IST / UTokyo 26

27 Shadow / Halo Extra array region for incoming message Parallel Numerical Algorithms / IST / UTokyo 27

28 Order of messages Messages are not-overtaking But some reports that overtaking happens on FX10 Solvable by forced matching with tags No fairness is guaranteed Parallel Numerical Algorithms / IST / UTokyo 28

29 PNA16 Lecture Plan General Topics 1 Architecture and Performance 2 Dependency 3 Locality 4 Scheduling MIMD / Distributed Memory 5 MPI: Message Passing Interface 6 Collective Communication 7 Distributed Data Structure MIMD / Shared Memory 8 OpenMP 9 Cache Performance Special Lectures 5/30 How to use FX10 (Prof Ohshima) 6/6 Dynamic Parallelism (Prof Peri) SIMD / Shared Memory 10 GPU and CUDA 11 SIMD Performance Parallel Numerical Algorithms / IST / UTokyo 29

Parallel Numerical Algorithms

Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 8 ] OpenMP Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance