Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

Size: px

Start display at page:

Download "Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI"

Clyde Horn
5 years ago
Views:

1 Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-: Parallel FVM using MPI Kengo Nakajima Information Technology Center The University of Tokyo

2 Overview Introduction Local Data Structure & Communication D D

3 Goal of the Last Part FVM Code with OpenMP/MPI Hybrid Parallel Programming Model based-on the Initial Code (L-sol on the first day) Diagonal/Point Jacobi Preconditioning (METHOD=) OpenMP: Straight Forward MPI NO Data Dependency Just insert OpenMP Directives Flat-MPI:Each Core -> Independent MPI only Intra/Inter Node memory core core core core memory Hybrid:Hierarchal Structure core core core core memory core core core core Distributed Computation Special Data Structure OpenMP MPI memory core core core core memory core core core core memory core core core core

4 Some Technical Terms Processor, Core Processing Unit (H/W), Processor=Core for single-core proc s Process Unit for MPI computation, nearly equal to core Each core (or processor) can host multiple processes (but not efficient) PE (Processing Element) PE originally mean processor, but it is sometimes used as process in this class. Moreover it means domain (next) In multicore proc s: PE generally means core Domain domain=process (=PE), each of MD in SPMD, each data set Process ID of MPI (ID of PE, ID of domain) starts from 0 if you have processes (PE s, domains), ID is 0~7

5 5 Parallel Computing on Distributed Memory Architecture Faster, Larger & More Complicated Scalability Solving N x scale problem using N x computational resources during same computation time for large-scale problems: Weak Scaling e.g. CG solver: more iterations needed for larger problems Solving a problem using N x computational resources during /N computation time for faster computation: Strong Scaling CPU CPU CPU CPU CPU Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core

6 6 What is Parallel Computing? (/) to solve larger problems faster Homogeneous/Heterogeneous Porous Media Lawrence Livermore National Laboratory Homogeneous Heterogeneous very fine meshes are required for simulations of heterogeneous field.

7 7 What is Parallel Computing? (/) PC with GB memory : M meshes are the limit for FEM Southwest Japan with,000km x,000km x 00km in km mesh -> 0 meshes Large Data -> Domain Decomposition -> Local Operation Inter-Domain Communication for Global Operation Large-Scale Data partitioning Local Data Local Data Local Data Local Data Local Data Local Data Local Data Local Data Communication

8 PE: Processing Element Processor, Domain, Process SPMD mpirun -np M <Program> You understand 90% MPI, if you understand this figure. PE #0 PE # PE # PE #M- Program Program Program Program Data #0 Data # Data # Data #M- Each process does same operation for different data Large-scale data is decomposed, and each part is computed by each process It is ideal that parallel program is not different from serial one except communication.

9 9 What is Communication? Parallel Computing -> Local Operations Communications are required in Global Operations for Consistency.

10 Local Data Structures for Parallel 0 FVM/FDM using Krylov Iterative Solvers Example: D Mesh (5-point stencil)

11 Example: D FDM Mesh (5-point stencil) -regions/domains

12 Example: D Mesh (5-point stencil) -regions/domains

13 Example: D Mesh (5-point stencil) meshes at domain boundary need info. neighboring domains

14 Example: D Mesh (5-point stencil) meshes at domain boundary need info. neighboring domains

15 Example: D Mesh (5-point stencil) communications using HALO (overlapped meshes) 5

16 6 Coefficient Matrices for can be locally generated on each partition by this data structure

17 Operations in Parallel FVM SPMD: Single-Program Multiple-Data 7 Large Scale Data -> partitioned into Distributed Local Data Sets. FVM code can assembles coefficient matrix for each local data set : this part could be completely local, same as serial operations Global Operations & Communications happen only in Linear Solvers dot products, matrix-vector multiply, preconditioning Local Data Local Data Local Data Local Data FVM Matrix FVM Matrix FVM Matrix FVM Matrix Linear Solver MPI Linear Solver MPI Linear Solver MPI Linear Solver

18 PE: Processing Element Processor, Domain, Process SPMD mpirun -np M <Program> PE #0 PE # PE # PE #M- Program Program Program Program Data #0 Data # Data # Data #M- Each process does same operation for different data Large-scale data is decomposed, and each part is computed by each process It is ideal that parallel program is not different from serial one except communication.

19 9 Parallel FVM Procedures Design on Local Data Structure is important for SPMD-type operations in the previous page Matrix Generation Preconditioned Iterative Solvers for Linear Equations

20 Example: D Mesh (5-point stencil) communications using HALO (overlapped meshes) 0

21 Internal / External / Boundary Nodes Internal Nodes/Meshes: Originally assigned to the process (domain)

22 Internal / External / Boundary Nodes External Nodes/Meshes: Originally assigned to other processes (domains), but referenced by the process: HALO

23 Internal / External / Boundary Nodes Boundary Nodes/Meshes: Internal nodes referred by other processes (domains) as external nodes

24 What is Communications? Getting information of external nodes from external partitions (local data) In this study, Generalized Communication Tables contain the information

25 5 Introduction Quick Overview of MPI Local Data Structure & Communication D D

26 6 D FVM: meshes/ domains

27 7 D FVM: meshes/ domains

28 # Internal Nodes should be balanced

29 9 Matrices are incomplete! 0 0 # # #

30 0 Connected Cell s + External Nodes 0 0 # # #

31 D FVM: meshes/ domains # # #

32 D FVM: meshes/ domains

33 Local Numbering for SPMD Numbering of internal nodes is -N (0-N-), same operations in serial program can be applied. How about numbering of external nodes? ?? 0?? 0

34 PE: Processing Element Processor, Domain, Process SPMD mpirun -np M <Program> PE #0 PE # PE # PE #M- Program Program Program Program Data #0 Data # Data # Data #M- Each process does same operation for different data Large-scale data is decomposed, and each part is computed by each process It is ideal that parallel program is not different from serial one except communication.

35 5 Local Numbering for SPMD Numbering of external nodes: N+, N+ (N,N+)

36 6 Preconditioned CG Solver Compute r (0) = b-[a]x (0) for i=,, solve [M]z (i-) = r (i-) i- = r (i-) z (i-) if i= p () = z (0) else i- = i- / i- p (i) = z (i-) + i- endif q (i) = [A]p (i) i = i- /p (i) q (i) x (i) = x (i-) + i p (i) r (i) = r (i-) - i q (i) check convergence r end p (i-) M D D D 0 0 N D N

37 7 Preconditioning, DAXPY Local Operations by Only Internal Points: Parallel Processing is possible 0 /* //-- {z}= [Minv]{r} */ for(i=0;i<n;i++){ W[Z][i] = W[DD][i] * W[R][i]; } /* //-- {x}= {x} + ALPHA*{p} DAXPY: double a{x} plus {y} // {r}= {r} - ALPHA*{q} */ for(i=0;i<n;i++){ U[i] += Alpha * W[P][i]; W[R][i] -= Alpha * W[Q][i]; }

38 Dot Products Global Summation needed: Communication? 0 /* //-- ALPHA= RHO / {p}{q} */ C = 0.0; for(i=0;i<n;i++){ C += W[P][i] * W[Q][i]; } Alpha = Rho / C;

39 9 MPI_Reduce P#0 A0 B0 C0 D0 P# A B C D P# A B C D Reduce P#0 P# P# op.a0-a op.b0-b op.c0-c op.d0-d P# A B C D P# Reduces values on all processes to a single value Summation, Product, Max, Min etc. MPI_Reduce (sendbuf,recvbuf,count,datatype,op,root,comm) sendbuf choice I starting address of send buffer recvbuf choice O starting address receive buffer type is defined by datatype count int I number of elements in send/receive buffer datatype MPI_Datatype I data type of elements of send/recive buffer FORTRAN MPI_INTEGER, MPI_REAL, MPI_DOUBLE_PRECISION, MPI_CHARACTER etc. C MPI_INT, MPI_FLOAT, MPI_DOUBLE, MPI_CHAR etc op MPI_Op I reduce operation MPI_MAX, MPI_MIN, MPI_SUM, MPI_PROD, MPI_LAND, MPI_BAND etc Users can define operations by MPI_OP_CREATE root int I rank of root process comm MPI_Comm I communicator C

40 0 MPI_Bcast P#0 A0 B0 C0 D0 P# Broadcast P#0 A0 B0 C0 D0 P# A0 B0 C0 D0 P# P# A0 B0 C0 D0 P# P# A0 B0 C0 D0 Broadcasts a message from the process with rank "root" to all other processes of the communicator MPI_Bcast (buffer,count,datatype,root,comm) buffer choice I/O starting address of buffer type is defined by datatype count int I number of elements in send/recv buffer datatype MPI_Datatype I data type of elements of send/recv buffer FORTRAN MPI_INTEGER, MPI_REAL, MPI_DOUBLE_PRECISION, MPI_CHARACTER etc. C MPI_INT, MPI_FLOAT, MPI_DOUBLE, MPI_CHAR etc. root int I rank of root process comm MPI_Comm I communicator C

41 MPI_Allreduce P#0 A0 B0 C0 D0 P# A B C D P# A B C D All reduce P#0 P# P# op.a0-a op.b0-b op.c0-c op.d0-d op.a0-a op.b0-b op.c0-c op.d0-d op.a0-a op.b0-b op.c0-c op.d0-d P# A B C D P# op.a0-a op.b0-b op.c0-c op.d0-d MPI_Reduce + MPI_Bcast Summation (of dot products) and MAX/MIN values are likely to utilized in each process call MPI_Allreduce (sendbuf,recvbuf,count,datatype,op, comm) sendbuf choice I starting address of send buffer recvbuf choice O starting address receive buffer type is defined by datatype count int I number of elements in send/recv buffer datatype MPI_Datatype I data type of elements of send/recv buffer op MPI_Op I reduce operation comm MPI_Comm I communicator C

42 op of MPI_Reduce/Allreduce C MPI_Reduce (sendbuf,recvbuf,count,datatype,op,root,comm) MPI_MAX,MPI_MIN Max, Min MPI_SUM,MPI_PROD Summation, Product MPI_LAND Logical AND

43 Matrix-Vector Products Values at External Points: P-to-P Communication /* //-- {q}= [A]{p} */ for(i=0;i<n;i++){ W[Q][i] = Diag[i] * W[P][i]; for(j=index[i];j<index[i+];j++){ W[Q][i] += AMat[j]*W[P][Item[j]-]; } }

44 Mat-Vec Products: Local Op. Possible =

45 5 Mat-Vec Products: Local Op. Possible =

46 6 Mat-Vec Products: Local Op. Possible = 0 0 0

47 7 Mat-Vec Products: Local Op. # = 0

48 Mat-Vec Products: Local Op. # = = 5 0 5

49 9 Mat-Vec Products: Local Op. # = = 0

50 50 D FVM: meshes/ domains

51 5 D FVM: meshes/ domains Local ID: Starting from 0 for mesh at each domain #0 5 # 5 6 # 5

52 5 D FVM: meshes/ domains Internal/External Nodes #0 0 # 0 5 # 0

53 5 Collective/Point-to-Poin Communication Collective Communication( 集団通信 ) MPI_Reduce, MPI_Scatter/Gather etc. Communications with all processes in the communicator Application Area BEM, Spectral Method, MD: global interactions are considered Dot products, MAX/MIN: Global Summation & Comparison Point-to-Point( 一対一通信 ) MPI_Send, MPI_Recv Communication with limited processes Neighbors Application Area FEM, FDM: Localized Method #0 0 0 # # 0 0

54 5 SEND: sending from boundary nodes Send continuous data to send buffer of neighbors MPI_Isend (sendbuf,count,datatype,dest,tag,comm,request) sendbuf choice I starting address of sending buffer count I I number of elements sent to each process datatype I I data type of elements of sending buffer dest I I rank of destination

55 MPI_Isend C 55 Begins a non-blocking send Send the contents of sending buffer (starting from sendbuf, number of messages: count) to dest with tag. Contents of sending buffer cannot be modified before calling corresponding MPI_Waitall. MPI_Isend (sendbuf,count,datatype,dest,tag,comm,request) sendbuf choice I starting address of sending buffer count int I number of elements in sending buffer datatype MPI_Datatype I datatype of each sending buffer element dest int I rank of destination tag int I message tag This integer can be used by the application to distinguish messages. Communication occurs if tag s of MPI_Isend and MPI_Irecv are matched. Usually tag is set to be 0 (in this class), comm MPI_Comm I communicator request MPI_Request O communication request array used in MPI_Waitall

56 56 RECV: receiving to external nodes Recv. continuous data to recv. buffer from neighbors MPI_Irecv (recvbuf,count,datatype,dest,tag,comm,request) recvbuf choice I starting address of receiving buffer count I I number of elements in receiving buffer datatype I I data type of elements of receiving buffer source I I rank of source

57 MPI_Irecv C Begins a non-blocking receive Receiving the contents of receiving buffer (starting from recvbuf, number of messages: count) from source with tag. Contents of receiving buffer cannot be used before calling corresponding MPI_Waitall. MPI_Irecv (recvbuf,count,datatype,source,tag,comm,request) recvbuf choice I starting address of receiving buffer count int I number of elements in receiving buffer datatype MPI_Datatype I datatype of each receiving buffer element source int I rank of source tag int I message tag This integer can be used by the application to distinguish messages. Communication occurs if tag s of MPI_Isend and MPI_Irecv are matched. Usually tag is set to be 0 (in this class), comm MPI_Comm I communicator request MPI_Request O communication request array used in MPI_Waitall 57

58 MPI_Waitall C MPI_Waitall blocks until all comm s, associated with request in the array, complete. It is used for synchronizing MPI_Isend and MPI_Irecv in this class. At sending phase, contents of sending buffer cannot be modified before calling corresponding MPI_Waitall. At receiving phase, contents of receiving buffer cannot be used before calling corresponding MPI_Waitall. MPI_Isend and MPI_Irecv can be synchronized simultaneously with a single MPI_Waitall if it is consitent. Same request should be used in MPI_Isend and MPI_Irecv. Its operation is similar to that of MPI_Barrier but, MPI_Waitall can not be replaced by MPI_Barrier. Possible troubles using MPI_Barrier instead of MPI_Waitall: Contents of request and status are not updated properly, very slow operations etc. MPI_Waitall (count,request,status) count int I number of processes to be synchronized request MPI_Request I/O comm. request used in MPI_Waitall (array size: count) status MPI_Status O array of status objects MPI_STATUS_SIZE: defined in mpif.h, mpi.h 5

59 59 Distributed Local Data Structure for Parallel Computation Distributed local data structure for domain-to-doain communications has been introduced, which is appropriate for such applications with sparse coefficient matrices (e.g. FDM, FEM, FVM etc.). SPMD Local Numbering: Internal pts to External pts Generalized communication table Everything is easy, if proper data structure is defined: Values at boundary pts are copied into sending buffers Send/Recv Values at external pts are updated through receiving buffers

60 60 Introduction Quick Overview of MPI Local Data Structure & Communication D D

61 D FDM (/5) Entire Mesh 6

62 D FDM (5-point, central difference) f y x x x W C E N S y y C S C N W C E f y x 6

63 Decompose into domains

64 domains: Global ID 6 PE# PE# PE# PE#

65 domains: Local ID 65 PE# PE# PE#0 PE#

66 External Points: Overlapped Region 66 PE# PE# y N C W E x x y S PE#0 PE#

67 External Points: Overlapped Region 67 PE# PE# PE#0 PE#

68 Local ID of External Points? 6 PE# ?? PE# ?? ?? ?????????????????? ?? ?? ?? PE#0?? PE#

69 69 Overlapped Region PE# ?? PE# ?? ?? ?????????????????? ?? ?? ?? PE#0?? PE#

70 Overlapped Region 70 PE# ?? PE# ?? ?? ?????????????????? ?? ?? ?? PE#0?? PE#

71 7 Problem Setting: D FDM D region with 6 meshes (x) Each mesh has global ID from to 6 In this example, this global ID is considered as dependent variable, such as temperature, pressure etc. Something like computed results

72 Problem Setting: Distributed Local Data 7 PE# PE# sub-domains. Info. of external points (global ID of mesh) is received from neighbors. PE#0 receives PE# PE# PE# PE#0 PE# PE#

73 7 Operations of D FDM x E y f C W N C S x y y N C W E x x y f C S

74 7 Operations of D FDM x E y f C W N C S x y y N C W E x x y f C S

75 75 Computation (/) PE# PE# PE# PE# On each PE, info. of internal pts (i=-n(=6)) are read from distributed local data, info. of boundary pts are sent to neighbors, and they are received as info. of external pts.

76 76 Computation (/): Before Send/Recv : 9: 9 7:? : 0: 50 :? : 5 : 5 9:? : 6 : 5 0:? 5: : 57 :? 6: : 5 :? 7: 5: 59 :? : 6: 60 :? PE# PE# : 7 9: 5 7:? : 0: 5 :? : 9 : 55 9:? : 0 : 56 0:? 5: 5 : 6 :? 6: 6 : 6 :? 7: 7 5: 6 :? : 6: 6 :? : 9: 7 7:? : 0: :? : : 9 9:? : : 0 0:? 5: 9 : 5 :? 6: 0 : 6 :? 7: 5: 7 :? : 6: :? PE# PE# : 5 9: 7:? : 6 0: :? : 7 : 9:? : : 0:? 5: : 9 :? 6: : 0 :? 7: 5 5: :? : 6 6: :?

77 77 Computation (/): Before Send/Recv : 9: 9 7:? : 0: 50 :? : 5 : 5 9:? : 6 : 5 0:? 5: : 57 :? 6: : 5 :? 7: 5: 59 :? : 6: 60 :? PE# PE# : 7 9: 5 7:? : 0: 5 :? : 9 : 55 9:? : 0 : 56 0:? 5: 5 : 6 :? 6: 6 : 6 :? 7: 7 5: 6 :? : 6: 6 :? : 9: 7 7:? : 0: :? : : 9 9:? : : 0 0:? 5: 9 : 5 :? 6: 0 : 6 :? 7: 5: 7 :? : 6: :? PE# PE# : 5 9: 7:? : 6 0: :? : 7 : 9:? : : 0:? 5: : 9 :? 6: : 0 :? 7: 5 5: :? : 6 6: :?

78 7 Computation (/): After Send/Recv : 9: 9 7: 7 : 0: 50 : 5 : 5 : 5 9: 5 : 6 : 5 0: 6 5: : 57 : 5 6: : 5 : 6 7: 5: 59 : 7 : 6: 60 : PE# PE# : 7 9: 5 7: 6 : 0: 5 : : 9 : 55 9: 5 : 0 : 56 0: 60 5: 5 : 6 : 9 6: 6 : 6 : 0 7: 7 5: 6 : : 6: 6 : : 9: 7 7: 5 : 0: : : : 9 9: : : 0 0: 9 5: 9 : 5 : 6: 0 : 6 : 7: 5: 7 : 5 : 6: : PE# PE# : 5 9: 7: : 6 0: : : 7 : 9: 0 : : 0: 5: : 9 : 7 6: : 0 : 7: 5 5: : 9 : 6 6: : 0

79 79 Overview of Distributed Local Data Example on PE#0 PE# PE# PE#0 PE# PE#0 PE# Value at each mesh (= Global ID) Local ID

80 0 SPMD PE #0 PE # PE # PE # a.out a.out a.out a.out Dist. Local Data Sets (Neighbors, Comm. Tables) sqm.0 sqm. sqm. sqm. Geometry Dist. Local Data Sets (Global ID of Internal Points) sq.0 sq. sq. sq. Results

81 D FDM: PE#0 Information at each domain (/) Internal Nodes/Points/Meshes Meshes originally assigned to the domain

D FDM: PE#0 Information at each domain (/) PE# 5 6 9 0 5 6 7 PE# Internal Nodes/Points/Meshes Meshes originally assigned to the domain External

82 D FDM: PE#0 Information at each domain (/) PE# PE# Internal Nodes/Points/Meshes Meshes originally assigned to the domain External Nodes/Points/Meshes Meshes originally assigned to different domain, but required for computation of meshes in the domain (meshes in overlapped regions) Sleeves Halo 暈輪

83 D FDM: PE#0 Information at each domain (/) PE# PE# Internal Nodes/Points/Meshes Meshes originally assigned to the domain External Nodes/Points/Meshes Meshes originally assigned to different domain, but required for computation of meshes in the domain (meshes in overlapped regions) Boundary Nodes/Points/Meshes Internal points, which are also external points of other domains (used in computations of meshes in other domains)

84 D FDM: PE#0 Information at each domain (/) PE# PE# Internal Nodes/Points/Meshes Meshes originally assigned to the domain External Nodes/Points/Meshes Meshes originally assigned to different domain, but required for computation of meshes in the domain (meshes in overlapped regions) Boundary Nodes/Points/Meshes Internal points, which are also external points of other domains (used in computations of meshes in other domains) Relationships between Domains Communication Table: External/Boundary Points Neighbors

85 5 Description of Distributed Local Data Internal/External Nodes Numbering: Starting from internal nodes, then external nodes after that Neighbors Shares overlapped meshes Number and ID of neighbors Import Table (Receive) From where, how many, and which external nodes are received/imported? Export Table (Send) To where, how many and which boundary nodes are sent/exported?

86 6 Overview of Distributed Local Data Example on PE#0 PE# PE# PE#0 PE# PE#0 PE# Value at each mesh (= Global ID) Local ID

87 7 Generalized Communication Table: Send Neighbors NeibPETot,NeibPE[neib] Message size for each neighbor export_index[neib], neib= 0, NeibPETot- ID of boundary nodes export_item[k], k= 0, export_index[neibpetot]- Messages to each neighbor SendBuf[k], k= 0, export_index[neibpetot]- C

88 SEND: MPI_Isend/Irecv/Waitall C SendBuf neib#0 neib# neib# neib# BUFlength_e BUFlength_e BUFlength_e BUFlength_e export_index[0] export_index[] export_index[] export_index[] export_index[] export_item (export_index[neib]:export_index[neib+]-) are sent to neib-th neighbor for (neib=0; neib<neibpetot;neib++){ for (k=export_index[neib];k<export_index[neib+];k++){ kk= export_item[k]; SendBuf[k]= VAL[kk]; } } for (neib=0; neib<neibpetot; neib++){ tag= 0; is_e= export_index[neib]; ie_e= export_index[neib+]; BUFlength_e= ie_e - is_e Copied to sending buffers } ierr= MPI_Isend (&SendBuf[iS_e], BUFlength_e, MPI_DOUBLE, NeibPE[neib], 0, MPI_COMM_WORLD, &ReqSend[neib]) MPI_Waitall(NeibPETot, ReqSend, StatSend);

89 9 Generalized Communication Table: Receive Neighbors NeibPETot,NeibPE[neib] Message size for each neighbor import_index[neib], neib= 0, NeibPETot- ID of external nodes import_item[k], k= 0, import_index[neibpetot]- Messages from each neighbor RecvBuf[k], k= 0, import_index[neibpetot]- C

90 RECV: MPI_Isend/Irecv/Waitall for (neib=0; neib<neibpetot; neib++){ tag= 0; is_i= import_index[neib]; ie_i= import_index[neib+]; BUFlength_i= ie_i - is_i C 90 } ierr= MPI_Irecv (&RecvBuf[iS_i], BUFlength_i, MPI_DOUBLE, NeibPE[neib], 0, MPI_COMM_WORLD, &ReqRecv[neib]) MPI_Waitall(NeibPETot, ReqRecv, StatRecv); for (neib=0; neib<neibpetot;neib++){ for (k=import_index[neib];k<import_index[neib+];k++){ kk= import_item[k]; VAL[kk]= RecvBuf[k]; } } Copied from receiving buffer import_item (import_index[neib]:import_index[neib+]-) are received from neib-th neighbor RecvBuf neib#0 neib# neib# neib# BUFlength_i BUFlength_i BUFlength_i BUFlength_i import_index[0] import_index[] import_index[] import_index[] import_index[]

91 9 Relationship SEND/RECV do neib=, NEIBPETOT is_e= export_index(neib-) + ie_e= export_index(neib ) BUFlength_e= ie_e + - is_e call MPI_ISEND & & (SENDbuf(iS_e), BUFlength_e, MPI_INTEGER, NEIBPE(neib), 0,& & MPI_COMM_WORLD, request_send(neib), ierr) enddo do neib=, NEIBPETOT is_i= import_index(neib-) + ie_i= import_index(neib ) BUFlength_i= ie_i + - is_i call MPI_IRECV & & (RECVbuf(iS_i), BUFlength_i, MPI_INTEGER, NEIBPE(neib), 0,& & MPI_COMM_WORLD, request_recv(neib), ierr) enddo Consistency of ID s of sources/destinations, size and contents of messages! Communication occurs when NEIBPE[neib] matches

92 9 Relationship SEND/RECV (#0 to #) # # # Send #0 Recv. # #5 #0 #9 NEIBPE(:)=,,5,9 #0 NEIBPE(:)=,0,0 Consistency of ID s of sources/destinations, size and contents of messages! Communication occurs when NEIBPE(neib) matches

93 9 Generalized Comm. Table (/6) PE# PE# #NEIBPEtot #NEIBPE #NODE 6 #IMPORT_index #IMPORT_items #EXPORT_index #EXPORT_items 6 5 6

94 9 Generalized Comm. Table (/6) PE# PE# #NEIBPEtot Number of neighbors #NEIBPE ID of neighbors #NODE 6 Ext/Int Pts, Int Pts #IMPORT_index #IMPORT_items #EXPORT_index #EXPORT_items 6 5 6

95 95 Generalized Comm. Table (/6) PE# PE# #NEIBPEtot #NEIBPE #NODE 6 #IMPORT_index #IMPORT_items #EXPORT_index #EXPORT_items Four ext pts ( st - th items) are imported from st neighbor (PE#), and four (5 th - th items) are from nd neighbor (PE#).

96 96 Generalized Comm. Table (/6) PE# PE# #NEIBPEtot #NEIBPE #NODE 6 #IMPORT_index #IMPORT_items #EXPORT_index #EXPORT_items imported from st Neighbor (PE#) ( st - th items) imported from nd Neighbor (PE#) (5 th - th items)

97 97 Generalized Comm. Table (5/6) PE# PE# #NEIBPEtot #NEIBPE #NODE 6 #IMPORT_index #IMPORT_items Four boundary pts ( st - th items) are exported to st neighbor (PE#), and four (5 th - th items) are to nd neighbor (PE#). #EXPORT_index #EXPORT_items 6 5 6

98 9 Generalized Comm. Table (6/6) PE# PE# #NEIBPEtot #NEIBPE #NODE 6 #IMPORT_index #IMPORT_items #EXPORT_index #EXPORT_items exported to st Neighbor (PE#) ( st - th items) exported to nd Neighbor (PE#) (5 th - th items)

99 99 Generalized Comm. Table (6/6) PE# An external point is only sent from its original domain. A boundary point could be referred from more than one domain, and sent to multiple domains (e.g. 6 th mesh). 7 PE#

100 00 Notice: Send/Recv Arrays #PE0 send: VEC(start_send)~ VEC(start_send+length_send-) #PE send: VEC(start_send)~ VEC(start_send+length_send-) #PE0 recv: VEC(start_recv)~ VEC(start_recv+length_recv-) #PE recv: VEC(start_recv)~ VEC(start_recv+length_recv-) length_send of sending process must be equal to length_recv of receiving process. PE#0 to PE#, PE# to PE#0 sendbuf and recvbuf : different address

101 0 Copying files/d FDM on Oakleaf-FX >$ cd >$ cp /home/z00/omp/hybrid-c.tar. >$ cp /home/z00/omp/hybrid-f.tar. >$ tar xvf hybrid-c.tar (or hybrid-f.tar) >$ cd hybrid >$ ls S fvm (<$O-S>, <$O-fvm>) $ cd <$O-S> $ mpifrtpx Kfast sq-sr.f $ mpifccpx Kfast sq-sr.c (modify go.sh for processes) $ pjsub go.sh

102 0 Job Script for FX0:go.sh <$O-S>/go.sh Scheduling + Shell Script #!/bin/sh #PJM -L node= Number of Nodes #PJM -L elapse=00:0:00 Computation Time #PJM -L rscgrp=lecture7 Name of QUEUE #PJM -g gt7 Group Name (Wallet) #PJM -j #PJM -o teat.lst Standard Output #PJM --mpi proc= MPI Process # mpiexec./a.out Execs proc s node= proc= 6 proc s node= proc=6 proc s node= proc= 6 proc s node= proc=6 9 proc s node= proc=9 0

103 Example: sq-sr.c (/6) Initialization #include <stdio.h> #include <stdlib.h> #include <string.h> #include <assert.h> #include "mpi.h" int main(int argc, char **argv){ C 0 int n, np, NeibPeTot, BufLength; MPI_Status *StatSend, *StatRecv; MPI_Request *RequestSend, *RequestRecv; int MyRank, PeTot; int *val, *SendBuf, *RecvBuf, *NeibPe; int *ImportIndex, *ExportIndex, *ImportItem, *ExportItem; char FileName[0], line[0]; int i, nn, neib; int istart, iend; FILE *fp; /*!C !C INIT. MPI!C !C===*/ MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &PeTot); MPI_Comm_rank(MPI_COMM_WORLD, &MyRank);

104 Example: sq-sr.c (/6) Reading distributed local data files (sqm.*) /*!C !C DATA file!c !C===*/ sprintf(filename, "sqm.%d", MyRank); fp = fopen(filename, "r"); fscanf(fp, "%d", &NeibPeTot); NeibPe = calloc(neibpetot, sizeof(int)); ImportIndex = calloc(+neibpetot, sizeof(int)); ExportIndex = calloc(+neibpetot, sizeof(int)); for(neib=0;neib<neibpetot;neib++){ fscanf(fp, "%d", &NeibPe[neib]); } fscanf(fp, "%d %d", &np, &n); for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ImportIndex[neib]);} nn = ImportIndex[NeibPeTot]; ImportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ImportItem[i]); ImportItem[i]--;} for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ExportIndex[neib]);} nn = ExportIndex[NeibPeTot]; ExportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ExportItem[i]);ExportItem[i]--;} C 0

105 Example: sq-sr.c (/6) Reading distributed local data files (sqm.*) /*!C !C DATA file!c !C===*/ sprintf(filename, "sqm.%d", MyRank); fp = fopen(filename, "r"); fscanf(fp, "%d", &NeibPeTot); NeibPe = calloc(neibpetot, sizeof(int)); ImportIndex = calloc(+neibpetot, sizeof(int)); ExportIndex = calloc(+neibpetot, sizeof(int)); for(neib=0;neib<neibpetot;neib++){ fscanf(fp, "%d", &NeibPe[neib]); } fscanf(fp, "%d %d", &np, &n); for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ImportIndex[neib]);} nn = ImportIndex[NeibPeTot]; ImportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ImportItem[i]); ImportItem[i]--;} for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ExportIndex[neib]);} nn = ExportIndex[NeibPeTot]; ExportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ExportItem[i]);ExportItem[i]--;} #NEIBPEtot #NEIBPE #NODE 6 #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems C 05

106 Example: sq-sr.c (/6) Reading distributed local data files (sqm.*) /*!C !C DATA file!c !C===*/ sprintf(filename, "sqm.%d", MyRank); fp = fopen(filename, "r"); fscanf(fp, "%d", &NeibPeTot); NeibPe = calloc(neibpetot, sizeof(int)); ImportIndex = calloc(+neibpetot, sizeof(int)); ExportIndex np Number = of calloc(+neibpetot, all meshes (internal sizeof(int)); + external) n Number of internal meshes for(neib=0;neib<neibpetot;neib++){ fscanf(fp, "%d", &NeibPe[neib]); } fscanf(fp, "%d %d", &np, &n); for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ImportIndex[neib]);} nn = ImportIndex[NeibPeTot]; ImportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ImportItem[i]); ImportItem[i]--;} for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ExportIndex[neib]);} nn = ExportIndex[NeibPeTot]; ExportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ExportItem[i]);ExportItem[i]--;} #NEIBPEtot #NEIBPE #NODE 6 #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems C 06

107 Example: sq-sr.c (/6) Reading distributed local data files (sqm.*) /*!C !C DATA file!c !C===*/ sprintf(filename, "sqm.%d", MyRank); fp = fopen(filename, "r"); fscanf(fp, "%d", &NeibPeTot); NeibPe = calloc(neibpetot, sizeof(int)); ImportIndex = calloc(+neibpetot, sizeof(int)); ExportIndex = calloc(+neibpetot, sizeof(int)); for(neib=0;neib<neibpetot;neib++){ fscanf(fp, "%d", &NeibPe[neib]); } fscanf(fp, "%d %d", &np, &n); for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ImportIndex[neib]);} nn = ImportIndex[NeibPeTot]; ImportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ImportItem[i]); ImportItem[i]--;} for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ExportIndex[neib]);} nn = ExportIndex[NeibPeTot]; ExportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ExportItem[i]);ExportItem[i]--;} #NEIBPEtot #NEIBPE #NODE 6 #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems C 07

108 Example: sq-sr.c (/6) Reading distributed local data files (sqm.*) /*!C !C DATA file!c !C===*/ sprintf(filename, "sqm.%d", MyRank); fp = fopen(filename, "r"); fscanf(fp, "%d", &NeibPeTot); NeibPe = calloc(neibpetot, sizeof(int)); ImportIndex = calloc(+neibpetot, sizeof(int)); ExportIndex = calloc(+neibpetot, sizeof(int)); for(neib=0;neib<neibpetot;neib++){ fscanf(fp, "%d", &NeibPe[neib]); } fscanf(fp, "%d %d", &np, &n); for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ImportIndex[neib]);} nn = ImportIndex[NeibPeTot]; ImportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ImportItem[i]); ImportItem[i]--;} for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ExportIndex[neib]);} nn = ExportIndex[NeibPeTot]; ExportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ExportItem[i]);ExportItem[i]--;} #NEIBPEtot #NEIBPE #NODE 6 #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems C 0

109 09 RECV/Import: PE#0 #NEIBPEtot #NEIBPE #NODE 6 #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems PE#0 PE# PE#

110 Example: sq-sr.c (/6) Reading distributed local data files (sqm.*) /*!C !C DATA file!c !C===*/ sprintf(filename, "sqm.%d", MyRank); fp = fopen(filename, "r"); fscanf(fp, "%d", &NeibPeTot); NeibPe = calloc(neibpetot, sizeof(int)); ImportIndex = calloc(+neibpetot, sizeof(int)); ExportIndex = calloc(+neibpetot, sizeof(int)); for(neib=0;neib<neibpetot;neib++){ fscanf(fp, "%d", &NeibPe[neib]); } fscanf(fp, "%d %d", &np, &n); for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ImportIndex[neib]);} nn = ImportIndex[NeibPeTot]; ImportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ImportItem[i]); ImportItem[i]--;} for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ExportIndex[neib]);} nn = ExportIndex[NeibPeTot]; ExportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ExportItem[i]);ExportItem[i]--;} #NEIBPEtot #NEIBPE #NODE 6 #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems C 0

111 Example: sq-sr.c (/6) Reading distributed local data files (sqm.*) /*!C !C DATA file!c !C===*/ sprintf(filename, "sqm.%d", MyRank); fp = fopen(filename, "r"); fscanf(fp, "%d", &NeibPeTot); NeibPe = calloc(neibpetot, sizeof(int)); ImportIndex = calloc(+neibpetot, sizeof(int)); ExportIndex = calloc(+neibpetot, sizeof(int)); for(neib=0;neib<neibpetot;neib++){ fscanf(fp, "%d", &NeibPe[neib]); } fscanf(fp, "%d %d", &np, &n); for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ImportIndex[neib]);} nn = ImportIndex[NeibPeTot]; ImportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ImportItem[i]); ImportItem[i]--;} for(neib=;neib<neibpetot+;neib++){ fscanf(fp, "%d", &ExportIndex[neib]);} nn = ExportIndex[NeibPeTot]; ExportItem = malloc(nn * sizeof(int)); for(i=0;i<nn;i++){ fscanf(fp, "%d", &ExportItem[i]);ExportItem[i]--;} #NEIBPEtot #NEIBPE #NODE 6 #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems C

112 SEND/Export: PE#0 #NEIBPEtot #NEIBPE #NODE 6 #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems PE#0 PE# PE#

113 Example: sq-sr.c (/6) Reading distributed local data files (sq.*) C sprintf(filename, "sq.%d", MyRank); fp = fopen(filename, "r"); assert(fp!= NULL); val = calloc(np, sizeof(*val)); for(i=0;i<n;i++){ fscanf(fp, "%d", &val[i]); } PE# PE#0 PE# n : Number of internal points val : Global ID of meshes val on external points are unknown at this stage.

114 Example: sq-sr.c (/6) Preparation of sending/receiving buffers C /*!C!C !C BUFFER!C !C===*/ SendBuf = calloc(exportindex[neibpetot], sizeof(*sendbuf)); RecvBuf = calloc(importindex[neibpetot], sizeof(*recvbuf)); for(neib=0;neib<neibpetot;neib++){ istart = ExportIndex[neib]; iend = ExportIndex[neib+]; for(i=istart;i<iend;i++){ SendBuf[i] = val[exportitem[i]]; } } Info. of boundary points is written into sending buffer (SendBuf). Info. sent to NeibPe[neib] is stored in SendBuf[ExportIndex[neib]: ExportInedx[neib+]-]

115 5 Sending Buffer is nice... C for (neib=0; neib<neibpetot; neib++){ tag= 0; is_e= export_index[neib]; ie_e= export_index[neib+]; BUFlength_e= ie_e - is_e } ierr= MPI_Isend (&SendBuf[iS_e], BUFlength_e, MPI_DOUBLE, NeibPE[neib], 0, MPI_COMM_WORLD, &ReqSend[neib]) PE# Numbering of these boundary nodes is not continuous, therefore the following procedure of MPI_Isend is not applied directly: Starting address of sending buffer XX-messages from that address PE#0 PE#

116 6 Communication Pattern using D Structure halo halo halo halo Dr. Osni Marques (Lawrence Berkeley National Laboratory)

117 Example: sq-sr.c (5/6) SEND/Export: MPI_Isend C 7 /*!C!C !C SEND-RECV!C !C===*/ StatSend = malloc(sizeof(mpi_status) * NeibPeTot); StatRecv = malloc(sizeof(mpi_status) * NeibPeTot); RequestSend = malloc(sizeof(mpi_request) * NeibPeTot); RequestRecv = malloc(sizeof(mpi_request) * NeibPeTot); for(neib=0;neib<neibpetot;neib++){ istart = ExportIndex[neib]; iend = ExportIndex[neib+]; BufLength = iend - istart; MPI_Isend(&SendBuf[iStart], BufLength, MPI_INT, PE#0 PE# NeibPe[neib], 0, MPI_COMM_WORLD, &RequestSend[neib]); } for(neib=0;neib<neibpetot;neib++){ istart = ImportIndex[neib]; iend = ImportIndex[neib+]; BufLength = iend - istart; PE# PE# } MPI_Irecv(&RecvBuf[iStart], BufLength, MPI_INT, NeibPe[neib], 0, MPI_COMM_WORLD, &RequestRecv[neib]);

118 SEND/Export: PE#0 #NEIBPEtot #NEIBPE #NODE 6 #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems PE#0 PE# PE#

119 SEND: MPI_Isend/Irecv/Waitall C 9 SendBuf neib#0 neib# neib# neib# BUFlength_e BUFlength_e BUFlength_e BUFlength_e export_index[0] export_index[] export_index[] export_index[] export_index[] export_item (export_index[neib]:export_index[neib+]-) are sent to neib-th neighbor for (neib=0; neib<neibpetot;neib++){ for (k=export_index[neib];k<export_index[neib+];k++){ kk= export_item[k]; SendBuf[k]= VAL[kk]; } } for (neib=0; neib<neibpetot; neib++){ tag= 0; is_e= export_index[neib]; ie_e= export_index[neib+]; BUFlength_e= ie_e - is_e Copied to sending buffers } ierr= MPI_Isend (&SendBuf[iS_e], BUFlength_e, MPI_DOUBLE, NeibPE[neib], 0, MPI_COMM_WORLD, &ReqSend[neib]) MPI_Waitall(NeibPETot, ReqSend, StatSend);

120 MPI_Waitall C MPI_Waitall blocks until all comm s, associated with request in the array, complete. It is used for synchronizing MPI_Isend and MPI_Irecv in this class. At sending phase, contents of sending buffer cannot be modified before calling corresponding MPI_Waitall. At receiving phase, contents of receiving buffer cannot be used before calling corresponding MPI_Waitall. MPI_Isend and MPI_Irecv can be synchronized simultaneously with a single MPI_Waitall if it is consitent. Same request should be used in MPI_Isend and MPI_Irecv. Its operation is similar to that of MPI_Barrier but, MPI_Waitall can not be replaced by MPI_Barrier. Possible troubles using MPI_Barrier instead of MPI_Waitall: Contents of request and status are not updated properly, very slow operations etc. MPI_Waitall (count,request,status) count int I number of processes to be synchronized request MPI_Request I/O comm. request used in MPI_Waitall (array size: count) status MPI_Status O array of status objects MPI_STATUS_SIZE: defined in mpif.h, mpi.h 0

121 Notice: Send/Recv Arrays #PE0 send: VEC(start_send)~ VEC(start_send+length_send-) #PE send: VEC(start_send)~ VEC(start_send+length_send-) #PE0 recv: VEC(start_recv)~ VEC(start_recv+length_recv-) #PE recv: VEC(start_recv)~ VEC(start_recv+length_recv-) length_send of sending process must be equal to length_recv of receiving process. PE#0 to PE#, PE# to PE#0 sendbuf and recvbuf : different address

122 Relationship SEND/RECV do neib=, NEIBPETOT is_e= export_index(neib-) + ie_e= export_index(neib ) BUFlength_e= ie_e + - is_e call MPI_ISEND & & (SENDbuf(iS_e), BUFlength_e, MPI_INTEGER, NEIBPE(neib), 0,& & MPI_COMM_WORLD, request_send(neib), ierr) enddo do neib=, NEIBPETOT is_i= import_index(neib-) + ie_i= import_index(neib ) BUFlength_i= ie_i + - is_i call MPI_IRECV & & (RECVbuf(iS_i), BUFlength_i, MPI_INTEGER, NEIBPE(neib), 0,& & MPI_COMM_WORLD, request_recv(neib), ierr) enddo Consistency of ID s of sources/destinations, size and contents of messages! Communication occurs when NEIBPE(neib) matches

123 Relationship SEND/RECV (#0 to #) # # # Send #0 Recv. # #5 #0 #9 NEIBPE(:)=,,5,9 #0 NEIBPE(:)=,0,0 Consistency of ID s of sources/destinations, size and contents of messages! Communication occurs when NEIBPE(neib) matches

124 Example: sq-sr.c (5/6) RECV/Import: MPI_Irecv C /*!C!C !C SEND-RECV!C !C===*/ StatSend = malloc(sizeof(mpi_status) * NeibPeTot); StatRecv = malloc(sizeof(mpi_status) * NeibPeTot); RequestSend = malloc(sizeof(mpi_request) * NeibPeTot); RequestRecv = malloc(sizeof(mpi_request) * NeibPeTot); for(neib=0;neib<neibpetot;neib++){ istart = ExportIndex[neib]; iend = ExportIndex[neib+]; BufLength = iend - istart; MPI_Isend(&SendBuf[iStart], BufLength, MPI_INT, PE#0 PE# NeibPe[neib], 0, MPI_COMM_WORLD, &RequestSend[neib]); } for(neib=0;neib<neibpetot;neib++){ istart = ImportIndex[neib]; iend = ImportIndex[neib+]; BufLength = iend - istart; PE# PE# } MPI_Irecv(&RecvBuf[iStart], BufLength, MPI_INT, NeibPe[neib], 0, MPI_COMM_WORLD, &RequestRecv[neib]);

125 5 RECV/Import: PE#0 #NEIBPEtot #NEIBPE #NODE 6 #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems PE#0 PE# PE#

126 RECV: MPI_Isend/Irecv/Waitall for (neib=0; neib<neibpetot; neib++){ tag= 0; is_i= import_index[neib]; ie_i= import_index[neib+]; BUFlength_i= ie_i - is_i C 6 } ierr= MPI_Irecv (&RecvBuf[iS_i], BUFlength_i, MPI_DOUBLE, NeibPE[neib], 0, MPI_COMM_WORLD, &ReqRecv[neib]) MPI_Waitall(NeibPETot, ReqRecv, StatRecv); for (neib=0; neib<neibpetot;neib++){ for (k=import_index[neib];k<import_index[neib+];k++){ kk= import_item[k]; VAL[kk]= RecvBuf[k]; } } Copied from receiving buffer import_item (import_index[neib]:import_index[neib+]-) are received from neib-th neighbor RecvBuf neib#0 neib# neib# neib# BUFlength_i BUFlength_i BUFlength_i BUFlength_i import_index[0] import_index[] import_index[] import_index[] import_index[]

127 Example: sq-sr.c (6/6) Reading info of ext pts from receiving buffers C 7 MPI_Waitall(NeibPeTot, RequestRecv, StatRecv); for(neib=0;neib<neibpetot;neib++){ istart = ImportIndex[neib]; iend = ImportIndex[neib+]; for(i=istart;i<iend;i++){ val[importitem[i]] = RecvBuf[i]; } } MPI_Waitall(NeibPeTot, RequestSend, StatSend); /* Contents of RecvBuf are copied to values at external points.!c !C OUTPUT!C !C===*/ for(neib=0;neib<neibpetot;neib++){ istart = ImportIndex[neib]; iend = ImportIndex[neib+]; for(i=istart;i<iend;i++){ int in = ImportItem[i]; printf("recvbuf%d%d%d n", MyRank, NeibPe[neib], val[in]); } } MPI_Finalize(); } return 0;

128 Example: sq-sr.c (6/6) Writing values at external points C MPI_Waitall(NeibPeTot, RequestRecv, StatRecv); for(neib=0;neib<neibpetot;neib++){ istart = ImportIndex[neib]; iend = ImportIndex[neib+]; for(i=istart;i<iend;i++){ val[importitem[i]] = RecvBuf[i]; } } MPI_Waitall(NeibPeTot, RequestSend, StatSend); /*!C !C OUTPUT!C !C===*/ for(neib=0;neib<neibpetot;neib++){ istart = ImportIndex[neib]; iend = ImportIndex[neib+]; for(i=istart;i<iend;i++){ int in = ImportItem[i]; printf("recvbuf%d%d%d n", MyRank, NeibPe[neib], val[in]); } } MPI_Finalize(); } return 0;

129 9 PE# Results (PE#0) PE# RECVbuf 0 5 RECVbuf 0 RECVbuf 0 RECVbuf 0 9 RECVbuf 0 RECVbuf 0 RECVbuf 0 5 RECVbuf 0 6 RECVbuf 0 RECVbuf 0 RECVbuf 0 0 RECVbuf 0 RECVbuf 7 RECVbuf RECVbuf 9 RECVbuf PE# PE# RECVbuf 7 RECVbuf 5 RECVbuf 5 RECVbuf 6 RECVbuf 0 5 RECVbuf 0 6 RECVbuf 0 7 RECVbuf 0 RECVbuf 6 RECVbuf RECVbuf 5 RECVbuf 60 RECVbuf 9 RECVbuf 0 RECVbuf RECVbuf

130 0 PE# Results (PE#) PE# RECVbuf 0 5 RECVbuf 0 RECVbuf 0 RECVbuf 0 9 RECVbuf 0 RECVbuf 0 RECVbuf 0 5 RECVbuf 0 6 RECVbuf 0 RECVbuf 0 RECVbuf 0 0 RECVbuf 0 RECVbuf 7 RECVbuf RECVbuf 9 RECVbuf PE# PE# RECVbuf 7 RECVbuf 5 RECVbuf 5 RECVbuf 6 RECVbuf 0 5 RECVbuf 0 6 RECVbuf 0 7 RECVbuf 0 RECVbuf 6 RECVbuf RECVbuf 5 RECVbuf 60 RECVbuf 9 RECVbuf 0 RECVbuf RECVbuf

131 PE# Results (PE#) PE# RECVbuf 0 5 RECVbuf 0 RECVbuf 0 RECVbuf 0 9 RECVbuf 0 RECVbuf 0 RECVbuf 0 5 RECVbuf 0 6 RECVbuf 0 RECVbuf 0 RECVbuf 0 0 RECVbuf 0 RECVbuf 7 RECVbuf RECVbuf 9 RECVbuf PE# PE# RECVbuf 7 RECVbuf 5 RECVbuf 5 RECVbuf 6 RECVbuf 0 5 RECVbuf 0 6 RECVbuf 0 7 RECVbuf 0 RECVbuf 6 RECVbuf RECVbuf 5 RECVbuf 60 RECVbuf 9 RECVbuf 0 RECVbuf RECVbuf

132 PE# Results (PE#) PE# RECVbuf 0 5 RECVbuf 0 RECVbuf 0 RECVbuf 0 9 RECVbuf 0 RECVbuf 0 RECVbuf 0 5 RECVbuf 0 6 RECVbuf 0 RECVbuf 0 RECVbuf 0 0 RECVbuf 0 RECVbuf 7 RECVbuf RECVbuf 9 RECVbuf PE# PE# RECVbuf 7 RECVbuf 5 RECVbuf 5 RECVbuf 6 RECVbuf 0 5 RECVbuf 0 6 RECVbuf 0 7 RECVbuf 0 RECVbuf 6 RECVbuf RECVbuf 5 RECVbuf 60 RECVbuf 9 RECVbuf 0 RECVbuf RECVbuf

133 Distributed Local Data Structure for Parallel Computation Distributed local data structure for domain-to-doain communications has been introduced, which is appropriate for such applications with sparse coefficient matrices (e.g. FDM, FEM, FVM etc.). SPMD Local Numbering: Internal pts to External pts Generalized communication table Everything is easy, if proper data structure is defined: Values at boundary pts are copied into sending buffers Send/Recv Values at external pts are updated through receiving buffers

134 This Idea can be applied to FEM with more complicated geometries. Red Lacquered Gate in 6 Pes, 0,6 elements k-metis Load Balance=.0 edgecut = 7,56 p-metis Load Balance=.00 edgecut = 7,7

135 Initial Mesh Exercise

136 #PE Three Domains 5 Exercise #PE #PE

137 Three Domains #PE #PE0 #PE Exercise 7

138 PE#0: sqm.0: fill s #PE #PE0 #PE #PE #PE0 #PE #NEIBPEtot #NEIBPE #NODE (int+ext, int pts) #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems Exercise

139 PE#: sqm.: fill s #PE #PE0 #PE #PE #PE0 #PE #NEIBPEtot #NEIBPE 0 #NODE (int+ext, int pts) #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems Exercise 9

140 PE#: sqm.: fill s #PE #PE0 #PE #PE #PE0 #PE #NEIBPEtot #NEIBPE 0 #NODE 9 5 (int+ext, int pts) #IMPORTindex #IMPORTitems #EXPORTindex #EXPORTitems Exercise 0

141 #PE #PE0 #PE Exercise

142 Procedures Exercise Number of Internal/External Points Where do External Pts come from? IMPORTindex,IMPORTitems Sequence of NEIBPE Then check destinations of Boundary Pts. EXPORTindex,EXPORTitems Sequence of NEIBPE sq.* are in <$O-S>/ex Create sqm.* by yourself copy <$O-S>/a.out (by sq-sr.c) to <$O-S>/ex pjsub go.sh

What is Point-to-Point Comm.?

0 What is Point-to-Point Comm.? Collective Communication MPI_Reduce, MPI_Scatter/Gather etc. Communications with all processes in the communicator Application Area BEM, Spectral Method, MD: global interactions