Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

Size: px
Start display at page:

Download "Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI"

Transcription

1 Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-: Parallel FVM using MPI Kengo Nakajima Information Technology Center The University of Tokyo

2 Overview Introduction Local Data Structure & Communication D D

3 Goal of the Last Part FVM Code with OpenMP/MPI Hybrid Parallel Programming Model based-on the Initial Code (L-sol on the first day) Diagonal/Point Jacobi Preconditioning (METHOD=) OpenMP: Straight Forward MPI NO Data Dependency Just insert OpenMP Directives Flat-MPI:Each Core -> Independent MPI only Intra/Inter Node memory core core core core memory Hybrid:Hierarchal Structure core core core core memory core core core core Distributed Computation Special Data Structure OpenMP MPI memory core core core core memory core core core core memory core core core core

4 Some Technical Terms Processor, Core Processing Unit (H/W), Processor=Core for single-core proc s Process Unit for MPI computation, nearly equal to core Each core (or processor) can host multiple processes (but not efficient) PE (Processing Element) PE originally mean processor, but it is sometimes used as process in this class. Moreover it means domain (next) In multicore proc s: PE generally means core Domain domain=process (=PE), each of MD in SPMD, each data set Process ID of MPI (ID of PE, ID of domain) starts from 0 if you have processes (PE s, domains), ID is 0~7

5 5 Parallel Computing on Distributed Memory Architecture Faster, Larger & More Complicated Scalability Solving N x scale problem using N x computational resources during same computation time for large-scale problems: Weak Scaling e.g. CG solver: more iterations needed for larger problems Solving a problem using N x computational resources during /N computation time for faster computation: Strong Scaling CPU CPU CPU CPU CPU Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core

6 6 What is Parallel Computing? (/) to solve larger problems faster Homogeneous/Heterogeneous Porous Media Lawrence Livermore National Laboratory Homogeneous Heterogeneous very fine meshes are required for simulations of heterogeneous field.

7 7 What is Parallel Computing? (/) PC with GB memory : M meshes are the limit for FEM Southwest Japan with,000km x,000km x 00km in km mesh -> 0 meshes Large Data -> Domain Decomposition -> Local Operation Inter-Domain Communication for Global Operation Large-Scale Data partitioning Local Data Local Data Local Data Local Data Local Data Local Data Local Data Local Data Communication

8 PE: Processing Element Processor, Domain, Process SPMD mpirun -np M <Program> You understand 90% MPI, if you understand this figure. PE #0 PE # PE # PE #M- Program Program Program Program Data #0 Data # Data # Data #M- Each process does same operation for different data Large-scale data is decomposed, and each part is computed by each process It is ideal that parallel program is not different from serial one except communication.

9 9 What is Communication? Parallel Computing -> Local Operations Communications are required in Global Operations for Consistency.

10 Local Data Structures for Parallel 0 FVM/FDM using Krylov Iterative Solvers Example: D Mesh (5-point stencil)

11 Example: D FDM Mesh (5-point stencil) -regions/domains

12 Example: D Mesh (5-point stencil) -regions/domains

13 Example: D Mesh (5-point stencil) meshes at domain boundary need info. neighboring domains

14 Example: D Mesh (5-point stencil) meshes at domain boundary need info. neighboring domains

15 Example: D Mesh (5-point stencil) communications using HALO (overlapped meshes) 5

16 6 Coefficient Matrices for can be locally generated on each partition by this data structure

17 Operations in Parallel FVM SPMD: Single-Program Multiple-Data 7 Large Scale Data -> partitioned into Distributed Local Data Sets. FVM code can assembles coefficient matrix for each local data set : this part could be completely local, same as serial operations Global Operations & Communications happen only in Linear Solvers dot products, matrix-vector multiply, preconditioning Local Data Local Data Local Data Local Data FVM Matrix FVM Matrix FVM Matrix FVM Matrix Linear Solver MPI Linear Solver MPI Linear Solver MPI Linear Solver

18 PE: Processing Element Processor, Domain, Process SPMD mpirun -np M <Program> PE #0 PE # PE # PE #M- Program Program Program Program Data #0 Data # Data # Data #M- Each process does same operation for different data Large-scale data is decomposed, and each part is computed by each process It is ideal that parallel program is not different from serial one except communication.

19 9 Parallel FVM Procedures Design on Local Data Structure is important for SPMD-type operations in the previous page Matrix Generation Preconditioned Iterative Solvers for Linear Equations

20 Example: D Mesh (5-point stencil) communications using HALO (overlapped meshes) 0

21 Internal / External / Boundary Nodes Internal Nodes/Meshes: Originally assigned to the process (domain)

22 Internal / External / Boundary Nodes External Nodes/Meshes: Originally assigned to other processes (domains), but referenced by the process: HALO

23 Internal / External / Boundary Nodes Boundary Nodes/Meshes: Internal nodes referred by other processes (domains) as external nodes

24 What is Communications? Getting information of external nodes from external partitions (local data) In this study, Generalized Communication Tables contain the information

25 5 Introduction Quick Overview of MPI Local Data Structure & Communication D D

26 6 D FVM: meshes/ domains

27 7 D FVM: meshes/ domains

28 # Internal Nodes should be balanced

29 9 Matrices are incomplete! 0 0 # # #

30 0 Connected Cell s + External Nodes 0 0 # # #

31 D FVM: meshes/ domains # # #

32 D FVM: meshes/ domains

33 Local Numbering for SPMD Numbering of internal nodes is -N (0-N-), same operations in serial program can be applied. How about numbering of external nodes? ?? 0?? 0

34 PE: Processing Element Processor, Domain, Process SPMD mpirun -np M <Program> PE #0 PE # PE # PE #M- Program Program Program Program Data #0 Data # Data # Data #M- Each process does same operation for different data Large-scale data is decomposed, and each part is computed by each process It is ideal that parallel program is not different from serial one except communication.

35 5 Local Numbering for SPMD Numbering of external nodes: N+, N+ (N,N+)

36 6 Preconditioned CG Solver Compute r (0) = b-[a]x (0) for i=,, solve [M]z (i-) = r (i-) i- = r (i-) z (i-) if i= p () = z (0) else i- = i- / i- p (i) = z (i-) + i- endif q (i) = [A]p (i) i = i- /p (i) q (i) x (i) = x (i-) + i p (i) r (i) = r (i-) - i q (i) check convergence r end p (i-) M D D D 0 0 N D N

37 7 Preconditioning, DAXPY Local Operations by Only Internal Points: Parallel Processing is possible 0 /* //-- {z}= [Minv]{r} */ for(i=0;i<n;i++){ W[Z][i] = W[DD][i] * W[R][i]; } /* //-- {x}= {x} + ALPHA*{p} DAXPY: double a{x} plus {y} // {r}= {r} - ALPHA*{q} */ for(i=0;i<n;i++){ U[i] += Alpha * W[P][i]; W[R][i] -= Alpha * W[Q][i]; }

38 Dot Products Global Summation needed: Communication? 0 /* //-- ALPHA= RHO / {p}{q} */ C = 0.0; for(i=0;i<n;i++){ C += W[P][i] * W[Q][i]; } Alpha = Rho / C;

39 9 MPI_Reduce P#0 A0 B0 C0 D0 P# A B C D P# A B C D Reduce P#0 P# P# op.a0-a op.b0-b op.c0-c op.d0-d P# A B C D P# Reduces values on all processes to a single value Summation, Product, Max, Min etc. MPI_Reduce (sendbuf,recvbuf,count,datatype,op,root,comm) sendbuf choice I starting address of send buffer recvbuf choice O starting address receive buffer type is defined by datatype count int I number of elements in send/receive buffer datatype MPI_Datatype I data type of elements of send/recive buffer FORTRAN MPI_INTEGER, MPI_REAL, MPI_DOUBLE_PRECISION, MPI_CHARACTER etc. C MPI_INT, MPI_FLOAT, MPI_DOUBLE, MPI_CHAR etc op MPI_Op I reduce operation MPI_MAX, MPI_MIN, MPI_SUM, MPI_PROD, MPI_LAND, MPI_BAND etc Users can define operations by MPI_OP_CREATE root int I rank of root process comm MPI_Comm I communicator C

40 0 MPI_Bcast P#0 A0 B0 C0 D0 P# Broadcast P#0 A0 B0 C0 D0 P# A0 B0 C0 D0 P# P# A0 B0 C0 D0 P# P# A0 B0 C0 D0 Broadcasts a message from the process with rank "root" to all other processes of the communicator MPI_Bcast (buffer,count,datatype,root,comm) buffer choice I/O starting address of buffer type is defined by datatype count int I number of elements in send/recv buffer datatype MPI_Datatype I data type of elements of send/recv buffer FORTRAN MPI_INTEGER, MPI_REAL, MPI_DOUBLE_PRECISION, MPI_CHARACTER etc. C MPI_INT, MPI_FLOAT, MPI_DOUBLE, MPI_CHAR etc. root int I rank of root process comm MPI_Comm I communicator C

41 MPI_Allreduce P#0 A0 B0 C0 D0 P# A B C D P# A B C D All reduce P#0 P# P# op.a0-a op.b0-b op.c0-c op.d0-d op.a0-a op.b0-b op.c0-c op.d0-d op.a0-a op.b0-b op.c0-c op.d0-d P# A B C D P# op.a0-a op.b0-b op.c0-c op.d0-d MPI_Reduce + MPI_Bcast Summation (of dot products) and MAX/MIN values are likely to utilized in each process call MPI_Allreduce (sendbuf,recvbuf,count,datatype,op, comm) sendbuf choice I starting address of send buffer recvbuf choice O starting address receive buffer type is defined by datatype count int I number of elements in send/recv buffer datatype MPI_Datatype I data type of elements of send/recv buffer op MPI_Op I reduce operation comm MPI_Comm I communicator C

42 op of MPI_Reduce/Allreduce C MPI_Reduce (sendbuf,recvbuf,count,datatype,op,root,comm) MPI_MAX,MPI_MIN Max, Min MPI_SUM,MPI_PROD Summation, Product MPI_LAND Logical AND

43 Matrix-Vector Products Values at External Points: P-to-P Communication /* //-- {q}= [A]{p} */ for(i=0;i<n;i++){ W[Q][i] = Diag[i] * W[P][i]; for(j=index[i];j<index[i+];j++){ W[Q][i] += AMat[j]*W[P][Item[j]-]; } }

44 Mat-Vec Products: Local Op. Possible =

45 5 Mat-Vec Products: Local Op. Possible =

46 6 Mat-Vec Products: Local Op. Possible = 0 0 0

47 7 Mat-Vec Products: Local Op. # = 0

48 Mat-Vec Products: Local Op. # = = 5 0 5

49 9 Mat-Vec Products: Local Op. # = = 0

50 50 D FVM: meshes/ domains

51 5 D FVM: meshes/ domains Local ID: Starting from 0 for mesh at each domain #0 5 # 5 6 # 5

52 5 D FVM: meshes/ domains Internal/External Nodes #0 0 # 0 5 # 0

53 5 Collective/Point-to-Poin Communication Collective Communication( 集団通信 ) MPI_Reduce, MPI_Scatter/Gather etc. Communications with all processes in the communicator Application Area BEM, Spectral Method, MD: global interactions are considered Dot products, MAX/MIN: Global Summation & Comparison Point-to-Point( 一対一通信 ) MPI_Send, MPI_Recv Communication with limited processes Neighbors Application Area FEM, FDM: Localized Method #0 0 0 # # 0 0

54 5 SEND: sending from boundary nodes Send continuous data to send buffer of neighbors MPI_Isend (sendbuf,count,datatype,dest,tag,comm,request) sendbuf choice I starting address of sending buffer count I I number of elements sent to each process datatype I I data type of elements of sending buffer dest I I rank of destination

55 MPI_Isend C 55 Begins a non-blocking send Send the contents of sending buffer (starting from sendbuf, number of messages: count) to dest with tag. Contents of sending buffer cannot be modified before calling corresponding MPI_Waitall. MPI_Isend (sendbuf,count,datatype,dest,tag,comm,request) sendbuf choice I starting address of sending buffer count int I number of elements in sending buffer datatype MPI_Datatype I datatype of each sending buffer element dest int I rank of destination tag int I message tag This integer can be used by the application to distinguish messages. Communication occurs if tag s of MPI_Isend and MPI_Irecv are matched. Usually tag is set to be 0 (in this class), comm MPI_Comm I communicator request MPI_Request O communication request array used in MPI_Waitall

56 56 RECV: receiving to external nodes Recv. continuous data to recv. buffer from neighbors MPI_Irecv (recvbuf,count,datatype,dest,tag,comm,request) recvbuf choice I starting address of receiving buffer count I I number of elements in receiving buffer datatype I I data type of elements of receiving buffer source I I rank of source

57 MPI_Irecv C Begins a non-blocking receive Receiving the contents of receiving buffer (starting from recvbuf, number of messages: count) from source with tag. Contents of receiving buffer cannot be used before calling corresponding MPI_Waitall. MPI_Irecv (recvbuf,count,datatype,source,tag,comm,request) recvbuf choice I starting address of receiving buffer count int I number of elements in receiving buffer datatype MPI_Datatype I datatype of each receiving buffer element source int I rank of source tag int I message tag This integer can be used by the application to distinguish messages. Communication occurs if tag s of MPI_Isend and MPI_Irecv are matched. Usually tag is set to be 0 (in this class), comm MPI_Comm I communicator request MPI_Request O communication request array used in MPI_Waitall 57

58 MPI_Waitall C MPI_Waitall blocks until all comm s, associated with request in the array, complete. It is used for synchronizing MPI_Isend and MPI_Irecv in this class. At sending phase, contents of sending buffer cannot be modified before calling corresponding MPI_Waitall. At receiving phase, contents of receiving buffer cannot be used before calling corresponding MPI_Waitall. MPI_Isend and MPI_Irecv can be synchronized simultaneously with a single MPI_Waitall if it is consitent. Same request should be used in MPI_Isend and MPI_Irecv. Its operation is similar to that of MPI_Barrier but, MPI_Waitall can not be replaced by MPI_Barrier. Possible troubles using MPI_Barrier instead of MPI_Waitall: Contents of request and status are not updated properly, very slow operations etc. MPI_Waitall (count,request,status) count int I number of processes to be synchronized request MPI_Request I/O comm. request used in MPI_Waitall (array size: count) status MPI_Status O array of status objects MPI_STATUS_SIZE: defined in mpif.h, mpi.h 5

59 59 Distributed Local Data Structure for Parallel Computation Distributed local data structure for domain-to-doain communications has been introduced, which is appropriate for such applications with sparse coefficient matrices (e.g. FDM, FEM, FVM etc.). SPMD Local Numbering: Internal pts to External pts Generalized communication table Everything is easy, if proper data structure is defined: Values at boundary pts are copied into sending buffers Send/Recv Values at external pts are updated through receiving buffers

60 60 Introduction Quick Overview of MPI Local Data Structure & Communication D D

61 D FDM (/5) Entire Mesh 6

62 D FDM (5-point, central difference) f y x x x W C E N S y y C S C N W C E f y x 6

63 Decompose into domains

64 domains: Global ID 6 PE# PE# PE# PE#

65 domains: Local ID 65 PE# PE# PE#0 PE#

66 External Points: Overlapped Region 66 PE# PE# y N C W E x x y S PE#0 PE#

67 External Points: Overlapped Region 67 PE# PE# PE#0 PE#

68 Local ID of External Points? 6 PE# ?? PE# ?? ?? ?????????????????? ?? ?? ?? PE#0?? PE#

69 69 Overlapped Region PE# ?? PE# ?? ?? ?????????????????? ?? ?? ?? PE#0?? PE#

70 Overlapped Region 70 PE# ?? PE# ?? ?? ?????????????????? ?? ?? ?? PE#0?? PE#

71 7 Problem Setting: D FDM D region with 6 meshes (x) Each mesh has global ID from to 6 In this example, this global ID is considered as dependent variable, such as temperature, pressure etc. Something like computed results

72 Problem Setting: Distributed Local Data 7 PE# PE# sub-domains. Info. of external points (global ID of mesh) is received from neighbors. PE#0 receives PE# PE# PE# PE#0 PE# PE#

73 7 Operations of D FDM x E y f C W N C S x y y N C W E x x y f C S

74 7 Operations of D FDM x E y f C W N C S x y y N C W E x x y f C S

75 75 Computation (/) PE# PE# PE# PE# On each PE, info. of internal pts (i=-n(=6)) are read from distributed local data, info. of boundary pts are sent to neighbors, and they are received as info. of external pts.

76 76 Computation (/): Before Send/Recv : 9: 9 7:? : 0: 50 :? : 5 : 5 9:? : 6 : 5 0:? 5: : 57 :? 6: : 5 :? 7: 5: 59 :? : 6: 60 :? PE# PE# : 7 9: 5 7:? : 0: 5 :? : 9 : 55 9:? : 0 : 56 0:? 5: 5 : 6 :? 6: 6 : 6 :? 7: 7 5: 6 :? : 6: 6 :? : 9: 7 7:? : 0: :? : : 9 9:? : : 0 0:? 5: 9 : 5 :? 6: 0 : 6 :? 7: 5: 7 :? : 6: :? PE# PE# : 5 9: 7:? : 6 0: :? : 7 : 9:? : : 0:? 5: : 9 :? 6: : 0 :? 7: 5 5: :? : 6 6: :?

77 77 Computation (/): Before Send/Recv : 9: 9 7:? : 0: 50 :? : 5 : 5 9:? : 6 : 5 0:? 5: : 57 :? 6: : 5 :? 7: 5: 59 :? : 6: 60 :? PE# PE# : 7 9: 5 7:? : 0: 5 :? : 9 : 55 9:? : 0 : 56 0:? 5: 5 : 6 :? 6: 6 : 6 :? 7: 7 5: 6 :? : 6: 6 :? : 9: 7 7:? : 0: :? : : 9 9:? : : 0 0:? 5: 9 : 5 :? 6: 0 : 6 :? 7: 5: 7 :? : 6: :? PE# PE# : 5 9: 7:? : 6 0: :? : 7 : 9:? : : 0:? 5: : 9 :? 6: : 0 :? 7: 5 5: :? : 6 6: :?

78 7 Computation (/): After Send/Recv : 9: 9 7: 7 : 0: 50 : 5 : 5 : 5 9: 5 : 6 : 5 0: 6 5: : 57 : 5 6: : 5 : 6 7: 5: 59 : 7 : 6: 60 : PE# PE# : 7 9: 5 7: 6 : 0: 5 : : 9 : 55 9: 5 : 0 : 56 0: 60 5: 5 : 6 : 9 6: 6 : 6 : 0 7: 7 5: 6 : : 6: 6 : : 9: 7 7: 5 : 0: : : : 9 9: : : 0 0: 9 5: 9 : 5 : 6: 0 : 6 : 7: 5: 7 : 5 : 6: : PE# PE# : 5 9: 7: : 6 0: : : 7 : 9: 0 : : 0: 5: : 9 : 7 6: : 0 : 7: 5 5: : 9 : 6 6: : 0

79 79 Overview of Distributed Local Data Example on PE#0 PE# PE# PE#0 PE# PE#0 PE# Value at each mesh (= Global ID) Local ID

80 0 SPMD PE #0 PE # PE # PE # a.out a.out a.out a.out Dist. Local Data Sets (Neighbors, Comm. Tables) sqm.0 sqm. sqm. sqm. Geometry Dist. Local Data Sets (Global ID of Internal Points) sq.0 sq. sq. sq. Results

81 D FDM: PE#0 Information at each domain (/) Internal Nodes/Points/Meshes Meshes originally assigned to the domain

82 D FDM: PE#0 Information at each domain (/) PE# PE# Internal Nodes/Points/Meshes Meshes originally assigned to the domain External Nodes/Points/Meshes Meshes originally assigned to different domain, but required for computation of meshes in the domain (meshes in overlapped regions) Sleeves Halo 暈 輪

83 D FDM: PE#0 Information at each domain (/) PE# PE# Internal Nodes/Points/Meshes Meshes originally assigned to the domain External Nodes/Points/Meshes Meshes originally assigned to different domain, but required for computation of meshes in the domain (meshes in overlapped regions) Boundary Nodes/Points/Meshes Internal points, which are also external points of other domains (used in computations of meshes in other domains)

84 D FDM: PE#0 Information at each domain (/) PE# PE# Internal Nodes/Points/Meshes Meshes originally assigned to the domain External Nodes/Points/Meshes Meshes originally assigned to different domain, but required for computation of meshes in the domain (meshes in overlapped regions) Boundary Nodes/Points/Meshes Internal points, which are also external points of other domains (used in computations of meshes in other domains) Relationships between Domains Communication Table: External/Boundary Points Neighbors

85 5 Description of Distributed Local Data Internal/External Nodes Numbering: Starting from internal nodes, then external nodes after that Neighbors Shares overlapped meshes Number and ID of neighbors Import Table (Receive) From where, how many, and which external nodes are received/imported? Export Table (Send) To where, how many and which boundary nodes are sent/exported?

86 6 Overview of Distributed Local Data Example on PE#0 PE# PE# PE#0 PE# PE#0 PE# Value at each mesh (= Global ID) Local ID

87 7 Generalized Communication Table: Send Neighbors NeibPETot,NeibPE[neib] Message size for each neighbor export_index[neib], neib= 0, NeibPETot- ID of boundary nodes export_item[k], k= 0, export_index[neibpetot]- Messages to each neighbor SendBuf[k], k= 0, export_index[neibpetot]- C

88 SEND: MPI_Isend/Irecv/Waitall C SendBuf neib#0 neib# neib# neib# BUFlength_e BUFlength_e BUFlength_e BUFlength_e export_index[0] export_index[] export_index[] export_index[] export_index[] export_item (export_index[neib]:export_index[neib+]-) are sent to neib-th neighbor for (neib=0; neib<neibpetot;neib++){ for (k=export_index[neib];k<export_index[neib+];k++){ kk= export_item[k]; SendBuf[k]= VAL[kk]; } } for (neib=0; neib<neibpetot; neib++){ tag= 0; is_e= export_index[neib]; ie_e= export_index[neib+]; BUFlength_e= ie_e - is_e Copied to sending buffers } ierr= MPI_Isend (&SendBuf[iS_e], BUFlength_e, MPI_DOUBLE, NeibPE[neib], 0, MPI_COMM_WORLD, &ReqSend[neib]) MPI_Waitall(NeibPETot, ReqSend, StatSend);

89 9 Generalized Communication Table: Receive Neighbors NeibPETot,NeibPE[neib] Message size for each neighbor import_index[neib], neib= 0, NeibPETot- ID of external nodes import_item[k], k= 0, import_index[neibpetot]- Messages from each neighbor RecvBuf[k], k= 0, import_index[neibpetot]- C

90 RECV: MPI_Isend/Irecv/Waitall for (neib=0; neib<neibpetot; neib++){ tag= 0; is_i= import_index[neib]; ie_i= import_index[neib+]; BUFlength_i= ie_i - is_i C 90 } ierr= MPI_Irecv (&RecvBuf[iS_i], BUFlength_i, MPI_DOUBLE, NeibPE[neib], 0, MPI_COMM_WORLD, &ReqRecv[neib]) MPI_Waitall(NeibPETot, ReqRecv, StatRecv); for (neib=0; neib<neibpetot;neib++){ for (k=import_index[neib];k<import_index[neib+];k++){ kk= import_item[k]; VAL[kk]= RecvBuf[k]; } } Copied from receiving buffer import_item (import_index[neib]:import_index[neib+]-) are received from neib-th neighbor RecvBuf neib#0 neib# neib# neib# BUFlength_i BUFlength_i BUFlength_i BUFlength_i import_index[0] import_index[] import_index[] import_index[] import_index[]

91 9 Relationship SEND/RECV do neib=, NEIBPETOT is_e= export_index(neib-) + ie_e= export_index(neib ) BUFlength_e= ie_e + - is_e call MPI_ISEND & & (SENDbuf(iS_e), BUFlength_e, MPI_INTEGER, NEIBPE(neib), 0,& & MPI_COMM_WORLD, request_send(neib), ierr) enddo do neib=, NEIBPETOT is_i= import_index(neib-) + ie_i= import_index(neib ) BUFlength_i= ie_i + - is_i call MPI_IRECV & & (RECVbuf(iS_i), BUFlength_i, MPI_INTEGER, NEIBPE(neib), 0,& & MPI_COMM_WORLD, request_recv(neib), ierr) enddo Consistency of ID s of sources/destinations, size and contents of messages! Communication occurs when NEIBPE[neib] matches

92 9 Relationship SEND/RECV (#0 to #) # # # Send #0 Recv. # #5 #0 #9 NEIBPE(:)=,,5,9 #0 NEIBPE(:)=,0,0 Consistency of ID s of sources/destinations, size and contents of messages! Communication occurs when NEIBPE(neib) matches

93 9 Generalized Comm. Table (/6) PE# PE# #NEIBPEtot #NEIBPE #NODE 6 #IMPORT_index #IMPORT_items #EXPORT_index #EXPORT_items 6 5 6

94 9 Generalized Comm. Table (/6) PE# PE# #NEIBPEtot Number of neighbors #NEIBPE ID of neighbors #NODE 6 Ext/Int Pts, Int Pts #IMPORT_index #IMPORT_items #EXPORT_index #EXPORT_items 6 5 6

95 95 Generalized Comm. Table (/6) PE# PE# #NEIBPEtot #NEIBPE #NODE 6 #IMPORT_index #IMPORT_items #EXPORT_index #EXPORT_items Four ext pts ( st - th items) are imported from st neighbor (PE#), and four (5 th - th items) are from nd neighbor (PE#).

96 96 Generalized Comm. Table (/6) PE# PE# #NEIBPEtot #NEIBPE #NODE 6 #IMPORT_index #IMPORT_items #EXPORT_index #EXPORT_items imported from st Neighbor (PE#) ( st - th items) imported from nd Neighbor (PE#) (5 th - th items)

97 97 Generalized Comm. Table (5/6) PE# PE# #NEIBPEtot #NEIBPE #NODE 6 #IMPORT_index #IMPORT_items Four boundary pts ( st - th items) are exported to st neighbor (PE#), and four (5 th - th items) are to nd neighbor (PE#). #EXPORT_index #EXPORT_items 6 5 6

98 9 Generalized Comm. Table (6/6) PE# PE# #NEIBPEtot #NEIBPE #NODE 6 #IMPORT_index #IMPORT_items #EXPORT_index #EXPORT_items exported to st Neighbor (PE#) ( st - th items) exported to nd Neighbor (PE#) (5 th - th items)

99 99 Generalized Comm. Table (6/6) PE# An external point is only sent from its original domain. A boundary point could be referred from more than one domain, and sent to multiple domains (e.g. 6 th mesh). 7 PE#

100 00 Notice: Send/Recv Arrays #PE0 send: VEC(start_send)~ VEC(start_send+length_send-) #PE send: VEC(start_send)~ VEC(start_send+length_send-) #PE0 recv: VEC(start_recv)~ VEC(start_recv+length_recv-) #PE recv: VEC(start_recv)~ VEC(start_recv+length_recv-) length_send of sending process must be equal to length_recv of receiving process. PE#0 to PE#, PE# to PE#0 sendbuf and recvbuf : different address

101 0 Copying files/d FDM on Oakleaf-FX >$ cd >$ cp /home/z00/omp/hybrid-c.tar. >$ cp /home/z00/omp/hybrid-f.tar. >$ tar xvf hybrid-c.tar (or hybrid-f.tar) >$ cd hybrid >$ ls S fvm (<$O-S>, <$O-fvm>) $ cd <$O-S> $ mpifrtpx Kfast sq-sr.f $ mpifccpx Kfast sq-sr.c (modify go.sh for processes) $ pjsub go.sh

102 0 Job Script for FX0:go.sh <$O-S>/go.sh Scheduling + Shell Script #!/bin/sh #PJM -L node= Number of Nodes #PJM -L elapse=00:0:00 Computation Time #PJM -L rscgrp=lecture7 Name of QUEUE #PJM -g gt7 Group Name (Wallet) #PJM -j #PJM -o teat.lst Standard Output #PJM --mpi proc= MPI Process # mpiexec./a.out Execs proc s node= proc= 6 proc s node= proc=6 proc s node= proc= 6 proc s node= proc=6 9 proc s node= proc=9 0

103 Example: sq-sr.c (/6) Initialization #include <stdio.h> #include <stdlib.h> #include <string.h> #include <assert.h> #include "mpi.h" int main(int argc, char **argv){ C 0 int n, np, NeibPeTot, BufLength; MPI_Status *StatSend, *StatRecv; MPI_Request *RequestSend, *RequestRecv; int MyRank, PeTot; int *val, *SendBuf, *RecvBuf, *NeibPe; int *ImportIndex, *ExportIndex, *ImportItem, *ExportItem; char FileName[0], line[0]; int i, nn, neib; int istart, iend; FILE *fp; /*!C !C INIT. MPI!C !C===*/ MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &PeTot); MPI_Comm_rank(MPI_COMM_WORLD, &MyRank);

104 Example: sq-sr.c (/6) Reading distributed local data files (sqm.*) /*!C !C DATA file!c !C===*/ sprintf(filename, "sqm.%d", MyRank); fp = fopen(filename, "r"); fscanf(fp, "%d", &NeibPeTot); NeibPe = calloc(neibpetot, sizeof(int)); ImportIndex = calloc(+neibpetot, sizeof(int)); ExportIndex = calloc(+neibpetot, sizeof(int)); for(neib=0;neib<neibpetot;neib++){ fscanf(fp, "%d", &NeibPe[neib]); } fscanf(fp, "%d %d", &np, &n); for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ImportIndex[neib]);} nn = ImportIndex[NeibPeTot]; ImportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ImportItem[i]); ImportItem[i]--;} for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ExportIndex[neib]);} nn = ExportIndex[NeibPeTot]; ExportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ExportItem[i]);ExportItem[i]--;} C 0

105 Example: sq-sr.c (/6) Reading distributed local data files (sqm.*) /*!C !C DATA file!c !C===*/ sprintf(filename, "sqm.%d", MyRank); fp = fopen(filename, "r"); fscanf(fp, "%d", &NeibPeTot); NeibPe = calloc(neibpetot, sizeof(int)); ImportIndex = calloc(+neibpetot, sizeof(int)); ExportIndex = calloc(+neibpetot, sizeof(int)); for(neib=0;neib<neibpetot;neib++){ fscanf(fp, "%d", &NeibPe[neib]); } fscanf(fp, "%d %d", &np, &n); for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ImportIndex[neib]);} nn = ImportIndex[NeibPeTot]; ImportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ImportItem[i]); ImportItem[i]--;} for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ExportIndex[neib]);} nn = ExportIndex[NeibPeTot]; ExportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ExportItem[i]);ExportItem[i]--;} #NEIBPEtot #NEIBPE #NODE 6 #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems C 05

106 Example: sq-sr.c (/6) Reading distributed local data files (sqm.*) /*!C !C DATA file!c !C===*/ sprintf(filename, "sqm.%d", MyRank); fp = fopen(filename, "r"); fscanf(fp, "%d", &NeibPeTot); NeibPe = calloc(neibpetot, sizeof(int)); ImportIndex = calloc(+neibpetot, sizeof(int)); ExportIndex np Number = of calloc(+neibpetot, all meshes (internal sizeof(int)); + external) n Number of internal meshes for(neib=0;neib<neibpetot;neib++){ fscanf(fp, "%d", &NeibPe[neib]); } fscanf(fp, "%d %d", &np, &n); for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ImportIndex[neib]);} nn = ImportIndex[NeibPeTot]; ImportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ImportItem[i]); ImportItem[i]--;} for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ExportIndex[neib]);} nn = ExportIndex[NeibPeTot]; ExportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ExportItem[i]);ExportItem[i]--;} #NEIBPEtot #NEIBPE #NODE 6 #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems C 06

107 Example: sq-sr.c (/6) Reading distributed local data files (sqm.*) /*!C !C DATA file!c !C===*/ sprintf(filename, "sqm.%d", MyRank); fp = fopen(filename, "r"); fscanf(fp, "%d", &NeibPeTot); NeibPe = calloc(neibpetot, sizeof(int)); ImportIndex = calloc(+neibpetot, sizeof(int)); ExportIndex = calloc(+neibpetot, sizeof(int)); for(neib=0;neib<neibpetot;neib++){ fscanf(fp, "%d", &NeibPe[neib]); } fscanf(fp, "%d %d", &np, &n); for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ImportIndex[neib]);} nn = ImportIndex[NeibPeTot]; ImportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ImportItem[i]); ImportItem[i]--;} for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ExportIndex[neib]);} nn = ExportIndex[NeibPeTot]; ExportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ExportItem[i]);ExportItem[i]--;} #NEIBPEtot #NEIBPE #NODE 6 #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems C 07

108 Example: sq-sr.c (/6) Reading distributed local data files (sqm.*) /*!C !C DATA file!c !C===*/ sprintf(filename, "sqm.%d", MyRank); fp = fopen(filename, "r"); fscanf(fp, "%d", &NeibPeTot); NeibPe = calloc(neibpetot, sizeof(int)); ImportIndex = calloc(+neibpetot, sizeof(int)); ExportIndex = calloc(+neibpetot, sizeof(int)); for(neib=0;neib<neibpetot;neib++){ fscanf(fp, "%d", &NeibPe[neib]); } fscanf(fp, "%d %d", &np, &n); for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ImportIndex[neib]);} nn = ImportIndex[NeibPeTot]; ImportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ImportItem[i]); ImportItem[i]--;} for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ExportIndex[neib]);} nn = ExportIndex[NeibPeTot]; ExportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ExportItem[i]);ExportItem[i]--;} #NEIBPEtot #NEIBPE #NODE 6 #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems C 0

109 09 RECV/Import: PE#0 #NEIBPEtot #NEIBPE #NODE 6 #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems PE#0 PE# PE#

110 Example: sq-sr.c (/6) Reading distributed local data files (sqm.*) /*!C !C DATA file!c !C===*/ sprintf(filename, "sqm.%d", MyRank); fp = fopen(filename, "r"); fscanf(fp, "%d", &NeibPeTot); NeibPe = calloc(neibpetot, sizeof(int)); ImportIndex = calloc(+neibpetot, sizeof(int)); ExportIndex = calloc(+neibpetot, sizeof(int)); for(neib=0;neib<neibpetot;neib++){ fscanf(fp, "%d", &NeibPe[neib]); } fscanf(fp, "%d %d", &np, &n); for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ImportIndex[neib]);} nn = ImportIndex[NeibPeTot]; ImportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ImportItem[i]); ImportItem[i]--;} for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ExportIndex[neib]);} nn = ExportIndex[NeibPeTot]; ExportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ExportItem[i]);ExportItem[i]--;} #NEIBPEtot #NEIBPE #NODE 6 #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems C 0

111 Example: sq-sr.c (/6) Reading distributed local data files (sqm.*) /*!C !C DATA file!c !C===*/ sprintf(filename, "sqm.%d", MyRank); fp = fopen(filename, "r"); fscanf(fp, "%d", &NeibPeTot); NeibPe = calloc(neibpetot, sizeof(int)); ImportIndex = calloc(+neibpetot, sizeof(int)); ExportIndex = calloc(+neibpetot, sizeof(int)); for(neib=0;neib<neibpetot;neib++){ fscanf(fp, "%d", &NeibPe[neib]); } fscanf(fp, "%d %d", &np, &n); for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ImportIndex[neib]);} nn = ImportIndex[NeibPeTot]; ImportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ImportItem[i]); ImportItem[i]--;} for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ExportIndex[neib]);} nn = ExportIndex[NeibPeTot]; ExportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ExportItem[i]);ExportItem[i]--;} #NEIBPEtot #NEIBPE #NODE 6 #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems C

112 SEND/Export: PE#0 #NEIBPEtot #NEIBPE #NODE 6 #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems PE#0 PE# PE#

113 Example: sq-sr.c (/6) Reading distributed local data files (sq.*) C sprintf(filename, "sq.%d", MyRank); fp = fopen(filename, "r"); assert(fp!= NULL); val = calloc(np, sizeof(*val)); for(i=0;i<n;i++){ fscanf(fp, "%d", &val[i]); } PE# PE#0 PE# n : Number of internal points val : Global ID of meshes val on external points are unknown at this stage.

114 Example: sq-sr.c (/6) Preparation of sending/receiving buffers C /*!C!C !C BUFFER!C !C===*/ SendBuf = calloc(exportindex[neibpetot], sizeof(*sendbuf)); RecvBuf = calloc(importindex[neibpetot], sizeof(*recvbuf)); for(neib=0;neib<neibpetot;neib++){ istart = ExportIndex[neib]; iend = ExportIndex[neib+]; for(i=istart;i<iend;i++){ SendBuf[i] = val[exportitem[i]]; } } Info. of boundary points is written into sending buffer (SendBuf). Info. sent to NeibPe[neib] is stored in SendBuf[ExportIndex[neib]: ExportInedx[neib+]-]

115 5 Sending Buffer is nice... C for (neib=0; neib<neibpetot; neib++){ tag= 0; is_e= export_index[neib]; ie_e= export_index[neib+]; BUFlength_e= ie_e - is_e } ierr= MPI_Isend (&SendBuf[iS_e], BUFlength_e, MPI_DOUBLE, NeibPE[neib], 0, MPI_COMM_WORLD, &ReqSend[neib]) PE# Numbering of these boundary nodes is not continuous, therefore the following procedure of MPI_Isend is not applied directly: Starting address of sending buffer XX-messages from that address PE#0 PE#

116 6 Communication Pattern using D Structure halo halo halo halo Dr. Osni Marques (Lawrence Berkeley National Laboratory)

117 Example: sq-sr.c (5/6) SEND/Export: MPI_Isend C 7 /*!C!C !C SEND-RECV!C !C===*/ StatSend = malloc(sizeof(mpi_status) * NeibPeTot); StatRecv = malloc(sizeof(mpi_status) * NeibPeTot); RequestSend = malloc(sizeof(mpi_request) * NeibPeTot); RequestRecv = malloc(sizeof(mpi_request) * NeibPeTot); for(neib=0;neib<neibpetot;neib++){ istart = ExportIndex[neib]; iend = ExportIndex[neib+]; BufLength = iend - istart; MPI_Isend(&SendBuf[iStart], BufLength, MPI_INT, PE#0 PE# NeibPe[neib], 0, MPI_COMM_WORLD, &RequestSend[neib]); } for(neib=0;neib<neibpetot;neib++){ istart = ImportIndex[neib]; iend = ImportIndex[neib+]; BufLength = iend - istart; PE# PE# } MPI_Irecv(&RecvBuf[iStart], BufLength, MPI_INT, NeibPe[neib], 0, MPI_COMM_WORLD, &RequestRecv[neib]);

118 SEND/Export: PE#0 #NEIBPEtot #NEIBPE #NODE 6 #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems PE#0 PE# PE#

119 SEND: MPI_Isend/Irecv/Waitall C 9 SendBuf neib#0 neib# neib# neib# BUFlength_e BUFlength_e BUFlength_e BUFlength_e export_index[0] export_index[] export_index[] export_index[] export_index[] export_item (export_index[neib]:export_index[neib+]-) are sent to neib-th neighbor for (neib=0; neib<neibpetot;neib++){ for (k=export_index[neib];k<export_index[neib+];k++){ kk= export_item[k]; SendBuf[k]= VAL[kk]; } } for (neib=0; neib<neibpetot; neib++){ tag= 0; is_e= export_index[neib]; ie_e= export_index[neib+]; BUFlength_e= ie_e - is_e Copied to sending buffers } ierr= MPI_Isend (&SendBuf[iS_e], BUFlength_e, MPI_DOUBLE, NeibPE[neib], 0, MPI_COMM_WORLD, &ReqSend[neib]) MPI_Waitall(NeibPETot, ReqSend, StatSend);

120 MPI_Waitall C MPI_Waitall blocks until all comm s, associated with request in the array, complete. It is used for synchronizing MPI_Isend and MPI_Irecv in this class. At sending phase, contents of sending buffer cannot be modified before calling corresponding MPI_Waitall. At receiving phase, contents of receiving buffer cannot be used before calling corresponding MPI_Waitall. MPI_Isend and MPI_Irecv can be synchronized simultaneously with a single MPI_Waitall if it is consitent. Same request should be used in MPI_Isend and MPI_Irecv. Its operation is similar to that of MPI_Barrier but, MPI_Waitall can not be replaced by MPI_Barrier. Possible troubles using MPI_Barrier instead of MPI_Waitall: Contents of request and status are not updated properly, very slow operations etc. MPI_Waitall (count,request,status) count int I number of processes to be synchronized request MPI_Request I/O comm. request used in MPI_Waitall (array size: count) status MPI_Status O array of status objects MPI_STATUS_SIZE: defined in mpif.h, mpi.h 0

121 Notice: Send/Recv Arrays #PE0 send: VEC(start_send)~ VEC(start_send+length_send-) #PE send: VEC(start_send)~ VEC(start_send+length_send-) #PE0 recv: VEC(start_recv)~ VEC(start_recv+length_recv-) #PE recv: VEC(start_recv)~ VEC(start_recv+length_recv-) length_send of sending process must be equal to length_recv of receiving process. PE#0 to PE#, PE# to PE#0 sendbuf and recvbuf : different address

122 Relationship SEND/RECV do neib=, NEIBPETOT is_e= export_index(neib-) + ie_e= export_index(neib ) BUFlength_e= ie_e + - is_e call MPI_ISEND & & (SENDbuf(iS_e), BUFlength_e, MPI_INTEGER, NEIBPE(neib), 0,& & MPI_COMM_WORLD, request_send(neib), ierr) enddo do neib=, NEIBPETOT is_i= import_index(neib-) + ie_i= import_index(neib ) BUFlength_i= ie_i + - is_i call MPI_IRECV & & (RECVbuf(iS_i), BUFlength_i, MPI_INTEGER, NEIBPE(neib), 0,& & MPI_COMM_WORLD, request_recv(neib), ierr) enddo Consistency of ID s of sources/destinations, size and contents of messages! Communication occurs when NEIBPE(neib) matches

123 Relationship SEND/RECV (#0 to #) # # # Send #0 Recv. # #5 #0 #9 NEIBPE(:)=,,5,9 #0 NEIBPE(:)=,0,0 Consistency of ID s of sources/destinations, size and contents of messages! Communication occurs when NEIBPE(neib) matches

124 Example: sq-sr.c (5/6) RECV/Import: MPI_Irecv C /*!C!C !C SEND-RECV!C !C===*/ StatSend = malloc(sizeof(mpi_status) * NeibPeTot); StatRecv = malloc(sizeof(mpi_status) * NeibPeTot); RequestSend = malloc(sizeof(mpi_request) * NeibPeTot); RequestRecv = malloc(sizeof(mpi_request) * NeibPeTot); for(neib=0;neib<neibpetot;neib++){ istart = ExportIndex[neib]; iend = ExportIndex[neib+]; BufLength = iend - istart; MPI_Isend(&SendBuf[iStart], BufLength, MPI_INT, PE#0 PE# NeibPe[neib], 0, MPI_COMM_WORLD, &RequestSend[neib]); } for(neib=0;neib<neibpetot;neib++){ istart = ImportIndex[neib]; iend = ImportIndex[neib+]; BufLength = iend - istart; PE# PE# } MPI_Irecv(&RecvBuf[iStart], BufLength, MPI_INT, NeibPe[neib], 0, MPI_COMM_WORLD, &RequestRecv[neib]);

125 5 RECV/Import: PE#0 #NEIBPEtot #NEIBPE #NODE 6 #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems PE#0 PE# PE#

126 RECV: MPI_Isend/Irecv/Waitall for (neib=0; neib<neibpetot; neib++){ tag= 0; is_i= import_index[neib]; ie_i= import_index[neib+]; BUFlength_i= ie_i - is_i C 6 } ierr= MPI_Irecv (&RecvBuf[iS_i], BUFlength_i, MPI_DOUBLE, NeibPE[neib], 0, MPI_COMM_WORLD, &ReqRecv[neib]) MPI_Waitall(NeibPETot, ReqRecv, StatRecv); for (neib=0; neib<neibpetot;neib++){ for (k=import_index[neib];k<import_index[neib+];k++){ kk= import_item[k]; VAL[kk]= RecvBuf[k]; } } Copied from receiving buffer import_item (import_index[neib]:import_index[neib+]-) are received from neib-th neighbor RecvBuf neib#0 neib# neib# neib# BUFlength_i BUFlength_i BUFlength_i BUFlength_i import_index[0] import_index[] import_index[] import_index[] import_index[]

127 Example: sq-sr.c (6/6) Reading info of ext pts from receiving buffers C 7 MPI_Waitall(NeibPeTot, RequestRecv, StatRecv); for(neib=0;neib<neibpetot;neib++){ istart = ImportIndex[neib]; iend = ImportIndex[neib+]; for(i=istart;i<iend;i++){ val[importitem[i]] = RecvBuf[i]; } } MPI_Waitall(NeibPeTot, RequestSend, StatSend); /* Contents of RecvBuf are copied to values at external points.!c !C OUTPUT!C !C===*/ for(neib=0;neib<neibpetot;neib++){ istart = ImportIndex[neib]; iend = ImportIndex[neib+]; for(i=istart;i<iend;i++){ int in = ImportItem[i]; printf("recvbuf%d%d%d n", MyRank, NeibPe[neib], val[in]); } } MPI_Finalize(); } return 0;

128 Example: sq-sr.c (6/6) Writing values at external points C MPI_Waitall(NeibPeTot, RequestRecv, StatRecv); for(neib=0;neib<neibpetot;neib++){ istart = ImportIndex[neib]; iend = ImportIndex[neib+]; for(i=istart;i<iend;i++){ val[importitem[i]] = RecvBuf[i]; } } MPI_Waitall(NeibPeTot, RequestSend, StatSend); /*!C !C OUTPUT!C !C===*/ for(neib=0;neib<neibpetot;neib++){ istart = ImportIndex[neib]; iend = ImportIndex[neib+]; for(i=istart;i<iend;i++){ int in = ImportItem[i]; printf("recvbuf%d%d%d n", MyRank, NeibPe[neib], val[in]); } } MPI_Finalize(); } return 0;

129 9 PE# Results (PE#0) PE# RECVbuf 0 5 RECVbuf 0 RECVbuf 0 RECVbuf 0 9 RECVbuf 0 RECVbuf 0 RECVbuf 0 5 RECVbuf 0 6 RECVbuf 0 RECVbuf 0 RECVbuf 0 0 RECVbuf 0 RECVbuf 7 RECVbuf RECVbuf 9 RECVbuf PE# PE# RECVbuf 7 RECVbuf 5 RECVbuf 5 RECVbuf 6 RECVbuf 0 5 RECVbuf 0 6 RECVbuf 0 7 RECVbuf 0 RECVbuf 6 RECVbuf RECVbuf 5 RECVbuf 60 RECVbuf 9 RECVbuf 0 RECVbuf RECVbuf

130 0 PE# Results (PE#) PE# RECVbuf 0 5 RECVbuf 0 RECVbuf 0 RECVbuf 0 9 RECVbuf 0 RECVbuf 0 RECVbuf 0 5 RECVbuf 0 6 RECVbuf 0 RECVbuf 0 RECVbuf 0 0 RECVbuf 0 RECVbuf 7 RECVbuf RECVbuf 9 RECVbuf PE# PE# RECVbuf 7 RECVbuf 5 RECVbuf 5 RECVbuf 6 RECVbuf 0 5 RECVbuf 0 6 RECVbuf 0 7 RECVbuf 0 RECVbuf 6 RECVbuf RECVbuf 5 RECVbuf 60 RECVbuf 9 RECVbuf 0 RECVbuf RECVbuf

131 PE# Results (PE#) PE# RECVbuf 0 5 RECVbuf 0 RECVbuf 0 RECVbuf 0 9 RECVbuf 0 RECVbuf 0 RECVbuf 0 5 RECVbuf 0 6 RECVbuf 0 RECVbuf 0 RECVbuf 0 0 RECVbuf 0 RECVbuf 7 RECVbuf RECVbuf 9 RECVbuf PE# PE# RECVbuf 7 RECVbuf 5 RECVbuf 5 RECVbuf 6 RECVbuf 0 5 RECVbuf 0 6 RECVbuf 0 7 RECVbuf 0 RECVbuf 6 RECVbuf RECVbuf 5 RECVbuf 60 RECVbuf 9 RECVbuf 0 RECVbuf RECVbuf

132 PE# Results (PE#) PE# RECVbuf 0 5 RECVbuf 0 RECVbuf 0 RECVbuf 0 9 RECVbuf 0 RECVbuf 0 RECVbuf 0 5 RECVbuf 0 6 RECVbuf 0 RECVbuf 0 RECVbuf 0 0 RECVbuf 0 RECVbuf 7 RECVbuf RECVbuf 9 RECVbuf PE# PE# RECVbuf 7 RECVbuf 5 RECVbuf 5 RECVbuf 6 RECVbuf 0 5 RECVbuf 0 6 RECVbuf 0 7 RECVbuf 0 RECVbuf 6 RECVbuf RECVbuf 5 RECVbuf 60 RECVbuf 9 RECVbuf 0 RECVbuf RECVbuf

133 Distributed Local Data Structure for Parallel Computation Distributed local data structure for domain-to-doain communications has been introduced, which is appropriate for such applications with sparse coefficient matrices (e.g. FDM, FEM, FVM etc.). SPMD Local Numbering: Internal pts to External pts Generalized communication table Everything is easy, if proper data structure is defined: Values at boundary pts are copied into sending buffers Send/Recv Values at external pts are updated through receiving buffers

134 This Idea can be applied to FEM with more complicated geometries. Red Lacquered Gate in 6 Pes, 0,6 elements k-metis Load Balance=.0 edgecut = 7,56 p-metis Load Balance=.00 edgecut = 7,7

135 Initial Mesh Exercise

136 #PE Three Domains 5 Exercise #PE #PE

137 Three Domains #PE #PE0 #PE Exercise 7

138 PE#0: sqm.0: fill s #PE #PE0 #PE #PE #PE0 #PE #NEIBPEtot #NEIBPE #NODE (int+ext, int pts) #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems Exercise

139 PE#: sqm.: fill s #PE #PE0 #PE #PE #PE0 #PE #NEIBPEtot #NEIBPE 0 #NODE (int+ext, int pts) #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems Exercise 9

140 PE#: sqm.: fill s #PE #PE0 #PE #PE #PE0 #PE #NEIBPEtot #NEIBPE 0 #NODE 9 5 (int+ext, int pts) #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems Exercise 0

141 #PE #PE0 #PE Exercise

142 Procedures Exercise Number of Internal/External Points Where do External Pts come from? IMPORTindex,IMPORTitems Sequence of NEIBPE Then check destinations of Boundary Pts. EXPORTindex,EXPORTitems Sequence of NEIBPE sq.* are in <$O-S>/ex Create sqm.* by yourself copy <$O-S>/a.out (by sq-sr.c) to <$O-S>/ex pjsub go.sh

What is Point-to-Point Comm.?

What is Point-to-Point Comm.? 0 What is Point-to-Point Comm.? Collective Communication MPI_Reduce, MPI_Scatter/Gather etc. Communications with all processes in the communicator Application Area BEM, Spectral Method, MD: global interactions

More information

Report S1 C. Kengo Nakajima Information Technology Center. Technical & Scientific Computing II ( ) Seminar on Computer Science II ( )

Report S1 C. Kengo Nakajima Information Technology Center. Technical & Scientific Computing II ( ) Seminar on Computer Science II ( ) Report S1 C Kengo Nakajima Information Technology Center Technical & Scientific Computing II (4820-1028) Seminar on Computer Science II (4810-1205) Problem S1-3 Report S1 (2/2) Develop parallel program

More information

Report S2 C. Kengo Nakajima Information Technology Center. Technical & Scientific Computing II ( ) Seminar on Computer Science II ( )

Report S2 C. Kengo Nakajima Information Technology Center. Technical & Scientific Computing II ( ) Seminar on Computer Science II ( ) Report S2 C Kengo Nakajima Information Technology Center Technical & Scientific Computing II (4820-1028) Seminar on Computer Science II (4810-1205) S2-ref 2 Overview Distributed Local Data Program Results

More information

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI Introduction to Parallel Programming for Multi/Many Clusters Part II-3: Parallel FVM using MPI Kengo Nakajima Information Technology Center The University of Tokyo 2 Overview Introduction Local Data Structure

More information

Ex.2: Send-Recv an Array (1/4)

Ex.2: Send-Recv an Array (1/4) MPI-t1 1 Ex.2: Send-Recv an Array (1/4) Exchange VE (real, 8-byte) between PE#0 & PE#1 PE#0 to PE#1 PE#0: send VE[0]-VE[10] (length=11) PE#1: recv. as VE[25]-VE[35] (length=11) PE#1 to PE#0 PE#1: send

More information

Report S1 C. Kengo Nakajima. Programming for Parallel Computing ( ) Seminar on Advanced Computing ( )

Report S1 C. Kengo Nakajima. Programming for Parallel Computing ( ) Seminar on Advanced Computing ( ) Report S1 C Kengo Nakajima Programming for Parallel Computing (616-2057) Seminar on Advanced Computing (616-4009) Problem S1-1 Report S1 (1/2) Read local files /a1.0~a1.3, /a2.0~a2.3. Develop

More information

Report S1 C. Kengo Nakajima

Report S1 C. Kengo Nakajima Report S1 C Kengo Nakajima Technical & Scientific Computing II (4820-1028) Seminar on Computer Science II (4810-1205) Hybrid Distributed Parallel Computing (3747-111) Problem S1-1 Report S1 Read local

More information

Report S2 C. Kengo Nakajima. Programming for Parallel Computing ( ) Seminar on Advanced Computing ( )

Report S2 C. Kengo Nakajima. Programming for Parallel Computing ( ) Seminar on Advanced Computing ( ) Report S2 C Kengo Nakajima Programming for Parallel Computing (616-2057) Seminar on Advanced Computing (616-4009) S2-ref 2 Overview Distributed Local Data Program Results FEM1D 3 1D Steady State Heat Conduction

More information

Report S1 Fortran. Kengo Nakajima Information Technology Center

Report S1 Fortran. Kengo Nakajima Information Technology Center Report S1 Fortran Kengo Nakajima Information Technology Center Technical & Scientific Computing II (4820-1028) Seminar on Computer Science II (4810-1205) Problem S1-1 Report S1 Read local files /a1.0~a1.3,

More information

Introduction to Programming by MPI for Parallel FEM Report S1 & S2 in C

Introduction to Programming by MPI for Parallel FEM Report S1 & S2 in C Introduction to Programming by MPI for Parallel FEM Report S1 & S2 in C Kengo Nakajima Technical & Scientific Computing II (4820-1028) Seminar on Computer Science II (4810-1205) Hybrid Distributed Parallel

More information

MPI. (message passing, MIMD)

MPI. (message passing, MIMD) MPI (message passing, MIMD) What is MPI? a message-passing library specification extension of C/C++ (and Fortran) message passing for distributed memory parallel programming Features of MPI Point-to-point

More information

Introduction to MPI. Ricardo Fonseca. https://sites.google.com/view/rafonseca2017/

Introduction to MPI. Ricardo Fonseca. https://sites.google.com/view/rafonseca2017/ Introduction to MPI Ricardo Fonseca https://sites.google.com/view/rafonseca2017/ Outline Distributed Memory Programming (MPI) Message Passing Model Initializing and terminating programs Point to point

More information

Introduction to Programming by MPI for Parallel FEM Report S1 & S2 in C

Introduction to Programming by MPI for Parallel FEM Report S1 & S2 in C Introduction to Programming by MPI for Parallel FEM Report S1 & S2 in C Kengo Nakajima Programming for Parallel Computing (616-2057) Seminar on Advanced Computing (616-4009) 1 Motivation for Parallel Computing

More information

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI CS 470 Spring 2017 Mike Lam, Professor Distributed Programming & MPI MPI paradigm Single program, multiple data (SPMD) One program, multiple processes (ranks) Processes communicate via messages An MPI

More information

Message Passing Interface

Message Passing Interface MPSoC Architectures MPI Alberto Bosio, Associate Professor UM Microelectronic Departement bosio@lirmm.fr Message Passing Interface API for distributed-memory programming parallel code that runs across

More information

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-5: Communication-Computation Overlapping and Some Other Issues

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-5: Communication-Computation Overlapping and Some Other Issues Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-5: Communication-Computation Overlapping and Some Other Issues Kengo Nakajima Information Technology Center The University of

More information

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI CS 470 Spring 2018 Mike Lam, Professor Distributed Programming & MPI MPI paradigm Single program, multiple data (SPMD) One program, multiple processes (ranks) Processes communicate via messages An MPI

More information

CS 179: GPU Programming. Lecture 14: Inter-process Communication

CS 179: GPU Programming. Lecture 14: Inter-process Communication CS 179: GPU Programming Lecture 14: Inter-process Communication The Problem What if we want to use GPUs across a distributed system? GPU cluster, CSIRO Distributed System A collection of computers Each

More information

Recap of Parallelism & MPI

Recap of Parallelism & MPI Recap of Parallelism & MPI Chris Brady Heather Ratcliffe The Angry Penguin, used under creative commons licence from Swantje Hess and Jannis Pohlmann. Warwick RSE 13/12/2017 Parallel programming Break

More information

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Parallelism Decompose the execution into several tasks according to the work to be done: Function/Task

More information

Introduction to the Message Passing Interface (MPI)

Introduction to the Message Passing Interface (MPI) Introduction to the Message Passing Interface (MPI) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction to the Message Passing Interface (MPI) Spring 2018

More information

Message Passing Interface

Message Passing Interface Message Passing Interface DPHPC15 TA: Salvatore Di Girolamo DSM (Distributed Shared Memory) Message Passing MPI (Message Passing Interface) A message passing specification implemented

More information

Communication-Computation. 3D Parallel FEM (V) Overlapping. Kengo Nakajima

Communication-Computation. 3D Parallel FEM (V) Overlapping. Kengo Nakajima 3D Parallel FEM (V) Communication-Computation Overlapping Kengo Nakajima Programming for Parallel Computing (616-2057) Seminar on Advanced Computing (616-4009) U y =0 @ y=y min (N z -1) elements N z nodes

More information

Message Passing Interface. most of the slides taken from Hanjun Kim

Message Passing Interface. most of the slides taken from Hanjun Kim Message Passing Interface most of the slides taken from Hanjun Kim Message Passing Pros Scalable, Flexible Cons Someone says it s more difficult than DSM MPI (Message Passing Interface) A standard message

More information

Introduction to Programming by MPI for Parallel FEM Report S1 & S2 in C

Introduction to Programming by MPI for Parallel FEM Report S1 & S2 in C Introduction to Programming by MPI for Parallel FEM Report S1 & S2 in C Kengo Nakajima Programming for Parallel Computing (616-2057) Seminar on Advanced Computing (616-4009) 1 Motivation for Parallel Computing

More information

First day. Basics of parallel programming. RIKEN CCS HPC Summer School Hiroya Matsuba, RIKEN CCS

First day. Basics of parallel programming. RIKEN CCS HPC Summer School Hiroya Matsuba, RIKEN CCS First day Basics of parallel programming RIKEN CCS HPC Summer School Hiroya Matsuba, RIKEN CCS Today s schedule: Basics of parallel programming 7/22 AM: Lecture Goals Understand the design of typical parallel

More information

HPC Parallel Programing Multi-node Computation with MPI - I

HPC Parallel Programing Multi-node Computation with MPI - I HPC Parallel Programing Multi-node Computation with MPI - I Parallelization and Optimization Group TATA Consultancy Services, Sahyadri Park Pune, India TCS all rights reserved April 29, 2013 Copyright

More information

3D Parallel FEM (V) Communication-Computation Overlapping

3D Parallel FEM (V) Communication-Computation Overlapping 3D Parallel FEM (V) Communication-Computation Overlapping Kengo Nakajima Programming for Parallel Computing (616-2057) Seminar on Advanced Computing (616-4009) 2 z Uniform Distributed Force in z-direction

More information

Department of Informatics V. HPC-Lab. Session 4: MPI, CG M. Bader, A. Breuer. Alex Breuer

Department of Informatics V. HPC-Lab. Session 4: MPI, CG M. Bader, A. Breuer. Alex Breuer HPC-Lab Session 4: MPI, CG M. Bader, A. Breuer Meetings Date Schedule 10/13/14 Kickoff 10/20/14 Q&A 10/27/14 Presentation 1 11/03/14 H. Bast, Intel 11/10/14 Presentation 2 12/01/14 Presentation 3 12/08/14

More information

Outline. Communication modes MPI Message Passing Interface Standard

Outline. Communication modes MPI Message Passing Interface Standard MPI THOAI NAM Outline Communication modes MPI Message Passing Interface Standard TERMs (1) Blocking If return from the procedure indicates the user is allowed to reuse resources specified in the call Non-blocking

More information

MPI Message Passing Interface

MPI Message Passing Interface MPI Message Passing Interface Portable Parallel Programs Parallel Computing A problem is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information

More information

IPM Workshop on High Performance Computing (HPC08) IPM School of Physics Workshop on High Perfomance Computing/HPC08

IPM Workshop on High Performance Computing (HPC08) IPM School of Physics Workshop on High Perfomance Computing/HPC08 IPM School of Physics Workshop on High Perfomance Computing/HPC08 16-21 February 2008 MPI tutorial Luca Heltai Stefano Cozzini Democritos/INFM + SISSA 1 When

More information

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface ) CSE 613: Parallel Programming Lecture 21 ( The Message Passing Interface ) Jesmin Jahan Tithi Department of Computer Science SUNY Stony Brook Fall 2013 ( Slides from Rezaul A. Chowdhury ) Principles of

More information

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Introduction to MPI May 20, 2013 Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Top500.org PERFORMANCE DEVELOPMENT 1 Eflop/s 162 Pflop/s PROJECTED 100 Pflop/s

More information

Experiencing Cluster Computing Message Passing Interface

Experiencing Cluster Computing Message Passing Interface Experiencing Cluster Computing Message Passing Interface Class 6 Message Passing Paradigm The Underlying Principle A parallel program consists of p processes with different address spaces. Communication

More information

MPI, Part 3. Scientific Computing Course, Part 3

MPI, Part 3. Scientific Computing Course, Part 3 MPI, Part 3 Scientific Computing Course, Part 3 Non-blocking communications Diffusion: Had to Global Domain wait for communications to compute Could not compute end points without guardcell data All work

More information

Outline. Communication modes MPI Message Passing Interface Standard. Khoa Coâng Ngheä Thoâng Tin Ñaïi Hoïc Baùch Khoa Tp.HCM

Outline. Communication modes MPI Message Passing Interface Standard. Khoa Coâng Ngheä Thoâng Tin Ñaïi Hoïc Baùch Khoa Tp.HCM THOAI NAM Outline Communication modes MPI Message Passing Interface Standard TERMs (1) Blocking If return from the procedure indicates the user is allowed to reuse resources specified in the call Non-blocking

More information

Distributed Memory Parallel Programming

Distributed Memory Parallel Programming COSC Big Data Analytics Parallel Programming using MPI Edgar Gabriel Spring 201 Distributed Memory Parallel Programming Vast majority of clusters are homogeneous Necessitated by the complexity of maintaining

More information

Collective Communications I

Collective Communications I Collective Communications I Ned Nedialkov McMaster University Canada CS/SE 4F03 January 2016 Outline Introduction Broadcast Reduce c 2013 16 Ned Nedialkov 2/14 Introduction A collective communication involves

More information

MPI, Part 2. Scientific Computing Course, Part 3

MPI, Part 2. Scientific Computing Course, Part 3 MPI, Part 2 Scientific Computing Course, Part 3 Something new: Sendrecv A blocking send and receive built in together Lets them happen simultaneously Can automatically pair the sends/recvs! dest, source

More information

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group MPI: Parallel Programming for Extreme Machines Si Hammond, High Performance Systems Group Quick Introduction Si Hammond, (sdh@dcs.warwick.ac.uk) WPRF/PhD Research student, High Performance Systems Group,

More information

Introduction to Programming by MPI for Parallel FEM Report S1 & S2 in Fortran

Introduction to Programming by MPI for Parallel FEM Report S1 & S2 in Fortran Introduction to Programming by MPI for Parallel FEM Report S1 & S2 in Fortran Kengo Nakajima Programming for Parallel Computing (616-2057) Seminar on Advanced Computing (616-4009) 1 Motivation for Parallel

More information

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam Clemens Grelck University of Amsterdam UvA / SurfSARA High Performance Computing and Big Data Course June 2014 Parallel Programming with Compiler Directives: OpenMP Message Passing Gentle Introduction

More information

Cluster Computing MPI. Industrial Standard Message Passing

Cluster Computing MPI. Industrial Standard Message Passing MPI Industrial Standard Message Passing MPI Features Industrial Standard Highly portable Widely available SPMD programming model Synchronous execution MPI Outer scope int MPI_Init( int *argc, char ** argv)

More information

Distributed Systems + Middleware Advanced Message Passing with MPI

Distributed Systems + Middleware Advanced Message Passing with MPI Distributed Systems + Middleware Advanced Message Passing with MPI Gianpaolo Cugola Dipartimento di Elettronica e Informazione Politecnico, Italy cugola@elet.polimi.it http://home.dei.polimi.it/cugola

More information

Introduction to MPI: Part II

Introduction to MPI: Part II Introduction to MPI: Part II Pawel Pomorski, University of Waterloo, SHARCNET ppomorsk@sharcnetca November 25, 2015 Summary of Part I: To write working MPI (Message Passing Interface) parallel programs

More information

Scientific Computing

Scientific Computing Lecture on Scientific Computing Dr. Kersten Schmidt Lecture 21 Technische Universität Berlin Institut für Mathematik Wintersemester 2014/2015 Syllabus Linear Regression, Fast Fourier transform Modelling

More information

MPI point-to-point communication

MPI point-to-point communication MPI point-to-point communication Slides Sebastian von Alfthan CSC Tieteen tietotekniikan keskus Oy CSC IT Center for Science Ltd. Introduction MPI processes are independent, they communicate to coordinate

More information

東京大学情報基盤中心准教授片桐孝洋 Takahiro Katagiri, Associate Professor, Information Technology Center, The University of Tokyo

東京大学情報基盤中心准教授片桐孝洋 Takahiro Katagiri, Associate Professor, Information Technology Center, The University of Tokyo Overview of MPI 東京大学情報基盤中心准教授片桐孝洋 Takahiro Katagiri, Associate Professor, Information Technology Center, The University of Tokyo 台大数学科学中心科学計算冬季学校 1 Agenda 1. Features of MPI 2. Basic MPI Functions 3. Reduction

More information

Basic MPI Communications. Basic MPI Communications (cont d)

Basic MPI Communications. Basic MPI Communications (cont d) Basic MPI Communications MPI provides two non-blocking routines: MPI_Isend(buf,cnt,type,dst,tag,comm,reqHandle) buf: source of data to be sent cnt: number of data elements to be sent type: type of each

More information

Collective Communication in MPI and Advanced Features

Collective Communication in MPI and Advanced Features Collective Communication in MPI and Advanced Features Pacheco s book. Chapter 3 T. Yang, CS240A. Part of slides from the text book, CS267 K. Yelick from UC Berkeley and B. Gropp, ANL Outline Collective

More information

Practical Introduction to Message-Passing Interface (MPI)

Practical Introduction to Message-Passing Interface (MPI) 1 Practical Introduction to Message-Passing Interface (MPI) October 1st, 2015 By: Pier-Luc St-Onge Partners and Sponsors 2 Setup for the workshop 1. Get a user ID and password paper (provided in class):

More information

lslogin3$ cd lslogin3$ tar -xvf ~train00/mpibasic_lab.tar cd mpibasic_lab/pi cd mpibasic_lab/decomp1d

lslogin3$ cd lslogin3$ tar -xvf ~train00/mpibasic_lab.tar cd mpibasic_lab/pi cd mpibasic_lab/decomp1d MPI Lab Getting Started Login to ranger.tacc.utexas.edu Untar the lab source code lslogin3$ cd lslogin3$ tar -xvf ~train00/mpibasic_lab.tar Part 1: Getting Started with simple parallel coding hello mpi-world

More information

Slides prepared by : Farzana Rahman 1

Slides prepared by : Farzana Rahman 1 Introduction to MPI 1 Background on MPI MPI - Message Passing Interface Library standard defined by a committee of vendors, implementers, and parallel programmers Used to create parallel programs based

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 4 Message-Passing Programming Learning Objectives n Understanding how MPI programs execute n Familiarity with fundamental MPI functions

More information

AMath 483/583 Lecture 21

AMath 483/583 Lecture 21 AMath 483/583 Lecture 21 Outline: Review MPI, reduce and bcast MPI send and receive Master Worker paradigm References: $UWHPSC/codes/mpi class notes: MPI section class notes: MPI section of bibliography

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalabissu-tokyoacjp/~reiji/pna16/ [ 5 ] MPI: Message Passing Interface Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1 Architecture

More information

Parallel hardware. Distributed Memory. Parallel software. COMP528 MPI Programming, I. Flynn s taxonomy:

Parallel hardware. Distributed Memory. Parallel software. COMP528 MPI Programming, I. Flynn s taxonomy: COMP528 MPI Programming, I www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Flynn s taxonomy: Parallel hardware SISD (Single

More information

MPI Lab. How to split a problem across multiple processors Broadcasting input to other nodes Using MPI_Reduce to accumulate partial sums

MPI Lab. How to split a problem across multiple processors Broadcasting input to other nodes Using MPI_Reduce to accumulate partial sums MPI Lab Parallelization (Calculating π in parallel) How to split a problem across multiple processors Broadcasting input to other nodes Using MPI_Reduce to accumulate partial sums Sharing Data Across Processors

More information

MPI MESSAGE PASSING INTERFACE

MPI MESSAGE PASSING INTERFACE MPI MESSAGE PASSING INTERFACE David COLIGNON CÉCI - Consortium des Équipements de Calcul Intensif http://hpc.montefiore.ulg.ac.be Outline Introduction From serial source code to parallel execution MPI

More information

MPI Tutorial. Shao-Ching Huang. IDRE High Performance Computing Workshop

MPI Tutorial. Shao-Ching Huang. IDRE High Performance Computing Workshop MPI Tutorial Shao-Ching Huang IDRE High Performance Computing Workshop 2013-02-13 Distributed Memory Each CPU has its own (local) memory This needs to be fast for parallel scalability (e.g. Infiniband,

More information

Message Passing Interface

Message Passing Interface Message Passing Interface by Kuan Lu 03.07.2012 Scientific researcher at Georg-August-Universität Göttingen and Gesellschaft für wissenschaftliche Datenverarbeitung mbh Göttingen Am Faßberg, 37077 Göttingen,

More information

CS 426. Building and Running a Parallel Application

CS 426. Building and Running a Parallel Application CS 426 Building and Running a Parallel Application 1 Task/Channel Model Design Efficient Parallel Programs (or Algorithms) Mainly for distributed memory systems (e.g. Clusters) Break Parallel Computations

More information

Introduction to parallel computing concepts and technics

Introduction to parallel computing concepts and technics Introduction to parallel computing concepts and technics Paschalis Korosoglou (support@grid.auth.gr) User and Application Support Unit Scientific Computing Center @ AUTH Overview of Parallel computing

More information

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI CS 470 Spring 2019 Mike Lam, Professor Distributed Programming & MPI MPI paradigm Single program, multiple data (SPMD) One program, multiple processes (ranks) Processes communicate via messages An MPI

More information

Parallel Programming, MPI Lecture 2

Parallel Programming, MPI Lecture 2 Parallel Programming, MPI Lecture 2 Ehsan Nedaaee Oskoee 1 1 Department of Physics IASBS IPM Grid and HPC workshop IV, 2011 Outline 1 Introduction and Review The Von Neumann Computer Kinds of Parallel

More information

More about MPI programming. More about MPI programming p. 1

More about MPI programming. More about MPI programming p. 1 More about MPI programming More about MPI programming p. 1 Some recaps (1) One way of categorizing parallel computers is by looking at the memory configuration: In shared-memory systems, the CPUs share

More information

Practical Course Scientific Computing and Visualization

Practical Course Scientific Computing and Visualization July 5, 2006 Page 1 of 21 1. Parallelization Architecture our target architecture: MIMD distributed address space machines program1 data1 program2 data2 program program3 data data3.. program(data) program1(data1)

More information

Introduction to MPI. Jerome Vienne Texas Advanced Computing Center January 10 th,

Introduction to MPI. Jerome Vienne Texas Advanced Computing Center January 10 th, Introduction to MPI Jerome Vienne Texas Advanced Computing Center January 10 th, 2013 Email: viennej@tacc.utexas.edu 1 Course Objectives & Assumptions Objectives Teach basics of MPI-Programming Share information

More information

MA471. Lecture 5. Collective MPI Communication

MA471. Lecture 5. Collective MPI Communication MA471 Lecture 5 Collective MPI Communication Today: When all the processes want to send, receive or both Excellent website for MPI command syntax available at: http://www-unix.mcs.anl.gov/mpi/www/ 9/10/2003

More information

Introduction to Programming by MPI for Parallel FEM Report S1 & S2 in Fortran

Introduction to Programming by MPI for Parallel FEM Report S1 & S2 in Fortran Introduction to Programming by MPI for Parallel FEM Report S1 & S2 in Fortran Kengo Nakajima Programming for Parallel Computing (616-2057) Seminar on Advanced Computing (616-4009) 1 Motivation for Parallel

More information

Parallel Programming. Using MPI (Message Passing Interface)

Parallel Programming. Using MPI (Message Passing Interface) Parallel Programming Using MPI (Message Passing Interface) Message Passing Model Simple implementation of the task/channel model Task Process Channel Message Suitable for a multicomputer Number of processes

More information

Overview of MPI 國家理論中心數學組 高效能計算 短期課程. 名古屋大学情報基盤中心教授片桐孝洋 Takahiro Katagiri, Professor, Information Technology Center, Nagoya University

Overview of MPI 國家理論中心數學組 高效能計算 短期課程. 名古屋大学情報基盤中心教授片桐孝洋 Takahiro Katagiri, Professor, Information Technology Center, Nagoya University Overview of MPI 名古屋大学情報基盤中心教授片桐孝洋 Takahiro Katagiri, Professor, Information Technology Center, Nagoya University 國家理論中心數學組 高效能計算 短期課程 1 Agenda 1. Features of MPI 2. Basic MPI Functions 3. Reduction Operations

More information

Parallel Programming using MPI. Supercomputing group CINECA

Parallel Programming using MPI. Supercomputing group CINECA Parallel Programming using MPI Supercomputing group CINECA Contents Programming with message passing Introduction to message passing and MPI Basic MPI programs MPI Communicators Send and Receive function

More information

Parallel Programming Using MPI

Parallel Programming Using MPI Parallel Programming Using MPI Short Course on HPC 15th February 2019 Aditya Krishna Swamy adityaks@iisc.ac.in SERC, Indian Institute of Science When Parallel Computing Helps? Want to speed up your calculation

More information

AMath 483/583 Lecture 18 May 6, 2011

AMath 483/583 Lecture 18 May 6, 2011 AMath 483/583 Lecture 18 May 6, 2011 Today: MPI concepts Communicators, broadcast, reduce Next week: MPI send and receive Iterative methods Read: Class notes and references $CLASSHG/codes/mpi MPI Message

More information

Parallel Computing. Distributed memory model MPI. Leopold Grinberg T. J. Watson IBM Research Center, USA. Instructor: Leopold Grinberg

Parallel Computing. Distributed memory model MPI. Leopold Grinberg T. J. Watson IBM Research Center, USA. Instructor: Leopold Grinberg Parallel Computing Distributed memory model MPI Leopold Grinberg T. J. Watson IBM Research Center, USA Why do we need to compute in parallel large problem size - memory constraints computation on a single

More information

MPI: The Message-Passing Interface. Most of this discussion is from [1] and [2].

MPI: The Message-Passing Interface. Most of this discussion is from [1] and [2]. MPI: The Message-Passing Interface Most of this discussion is from [1] and [2]. What Is MPI? The Message-Passing Interface (MPI) is a standard for expressing distributed parallelism via message passing.

More information

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data Message Passing as a Programming Paradigm Gentle Introduction to MPI Point-to-point Communication Message Passing

More information

Introduction to MPI. SuperComputing Applications and Innovation Department 1 / 143

Introduction to MPI. SuperComputing Applications and Innovation Department 1 / 143 Introduction to MPI Isabella Baccarelli - i.baccarelli@cineca.it Mariella Ippolito - m.ippolito@cineca.it Cristiano Padrin - c.padrin@cineca.it Vittorio Ruggiero - v.ruggiero@cineca.it SuperComputing Applications

More information

Programming with MPI Collectives

Programming with MPI Collectives Programming with MPI Collectives Jan Thorbecke Type to enter text Delft University of Technology Challenge the future Collectives Classes Communication types exercise: BroadcastBarrier Gather Scatter exercise:

More information

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III)

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III) Topics Lecture 7 MPI Programming (III) Collective communication (cont d) Point-to-point communication Basic point-to-point communication Non-blocking point-to-point communication Four modes of blocking

More information

CEE 618 Scientific Parallel Computing (Lecture 5): Message-Passing Interface (MPI) advanced

CEE 618 Scientific Parallel Computing (Lecture 5): Message-Passing Interface (MPI) advanced 1 / 32 CEE 618 Scientific Parallel Computing (Lecture 5): Message-Passing Interface (MPI) advanced Albert S. Kim Department of Civil and Environmental Engineering University of Hawai i at Manoa 2540 Dole

More information

Parallel programming with MPI Part I -Introduction and Point-to-Point Communications

Parallel programming with MPI Part I -Introduction and Point-to-Point Communications Parallel programming with MPI Part I -Introduction and Point-to-Point Communications A. Emerson, A. Marani, Supercomputing Applications and Innovation (SCAI), CINECA 23 February 2016 MPI course 2016 Contents

More information

2018/7/ :25-12:10

2018/7/ :25-12:10 2018/7/10 1 LU + 2018 7 10 10:25-12:10 2018/7/10 1. 4 10 ( ) 2. 4 17 l 3. 4 24 l 4. 5 1 l 5. 5 8 l ( ) 6. 5 15 l 2-2018 8 6 24 7. 5 22 l 8. 6 5 l - 9. 6 19 l 10. 7 3 1 ( 2) l 11. 7 3 2 l l 12. 7 10 l 13.

More information

A message contains a number of elements of some particular datatype. MPI datatypes:

A message contains a number of elements of some particular datatype. MPI datatypes: Messages Messages A message contains a number of elements of some particular datatype. MPI datatypes: Basic types. Derived types. Derived types can be built up from basic types. C types are different from

More information

Practical Scientific Computing: Performanceoptimized

Practical Scientific Computing: Performanceoptimized Practical Scientific Computing: Performanceoptimized Programming Programming with MPI November 29, 2006 Dr. Ralf-Peter Mundani Department of Computer Science Chair V Technische Universität München, Germany

More information

MPI MESSAGE PASSING INTERFACE

MPI MESSAGE PASSING INTERFACE MPI MESSAGE PASSING INTERFACE David COLIGNON, ULiège CÉCI - Consortium des Équipements de Calcul Intensif http://www.ceci-hpc.be Outline Introduction From serial source code to parallel execution MPI functions

More information

Introduction to MPI. SHARCNET MPI Lecture Series: Part I of II. Paul Preney, OCT, M.Sc., B.Ed., B.Sc.

Introduction to MPI. SHARCNET MPI Lecture Series: Part I of II. Paul Preney, OCT, M.Sc., B.Ed., B.Sc. Introduction to MPI SHARCNET MPI Lecture Series: Part I of II Paul Preney, OCT, M.Sc., B.Ed., B.Sc. preney@sharcnet.ca School of Computer Science University of Windsor Windsor, Ontario, Canada Copyright

More information

Parallel programming with MPI Part I -Introduction and Point-to-Point

Parallel programming with MPI Part I -Introduction and Point-to-Point Parallel programming with MPI Part I -Introduction and Point-to-Point Communications A. Emerson, Supercomputing Applications and Innovation (SCAI), CINECA 1 Contents Introduction to message passing and

More information

High Performance Computing

High Performance Computing High Performance Computing Course Notes 2009-2010 2010 Message Passing Programming II 1 Communications Point-to-point communications: involving exact two processes, one sender and one receiver For example,

More information

Introduction to MPI. Ritu Arora Texas Advanced Computing Center June 17,

Introduction to MPI. Ritu Arora Texas Advanced Computing Center June 17, Introduction to MPI Ritu Arora Texas Advanced Computing Center June 17, 2014 Email: rauta@tacc.utexas.edu 1 Course Objectives & Assumptions Objectives Teach basics of MPI-Programming Share information

More information

Distributed Memory Programming with MPI

Distributed Memory Programming with MPI Distributed Memory Programming with MPI Part 1 Bryan Mills, PhD Spring 2017 A distributed memory system A shared memory system Identifying MPI processes n Common pracace to idenafy processes by nonnegaave

More information

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs 1 The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs http://mpi-forum.org https://www.open-mpi.org/ Mike Bailey mjb@cs.oregonstate.edu Oregon State University mpi.pptx

More information

Outline. Introduction to HPC computing. OpenMP MPI. Introduction. Understanding communications. Collective communications. Communicators.

Outline. Introduction to HPC computing. OpenMP MPI. Introduction. Understanding communications. Collective communications. Communicators. Lecture 8 MPI Outline Introduction to HPC computing OpenMP MPI Introduction Understanding communications Collective communications Communicators Topologies Grouping Data for Communication Input / output

More information

Reusing this material

Reusing this material Messages Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

CSE 160 Lecture 15. Message Passing

CSE 160 Lecture 15. Message Passing CSE 160 Lecture 15 Message Passing Announcements 2013 Scott B. Baden / CSE 160 / Fall 2013 2 Message passing Today s lecture The Message Passing Interface - MPI A first MPI Application The Trapezoidal

More information

What s in this talk? Quick Introduction. Programming in Parallel

What s in this talk? Quick Introduction. Programming in Parallel What s in this talk? Parallel programming methodologies - why MPI? Where can I use MPI? MPI in action Getting MPI to work at Warwick Examples MPI: Parallel Programming for Extreme Machines Si Hammond,

More information

Collective Communications

Collective Communications Collective Communications Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

MPI Program Structure

MPI Program Structure MPI Program Structure Handles MPI communicator MPI_COMM_WORLD Header files MPI function format Initializing MPI Communicator size Process rank Exiting MPI 1 Handles MPI controls its own internal data structures

More information