Lightweight MPI Communicators with Applications to

Size: px

Start display at page:

Download "Lightweight MPI Communicators with Applications to"

Mercy Melton
5 years ago
Views:

1 Lightweight MPI Communicators with Applications to Michael Axtmann, Peter Sanders, Armin Wiebigke IPDPS May 22, 2018 KIT The Research University in the Helmholtz Association

2 Overview Communicators and communication Disadvantages of communicator construction Solutions for MPI RBC communicators Case study on sorting 1

3 Communicators in MPI MPI COMM WORLD Subcommunicator 0 3; 1 4; Blocking point-to-point 0 1 Send Nonblocking point-to-point 0 1 ISend Compute and Test 2 Receive 2 IReceive and Test Blocking collective 0 1 Scan Nonblocking collective 0 1 IScan Compute and Test 2 Scan 2 IScan Test 2

4 MPI Examples Communication over rows and columns Divide and conquer Usage of communicators Divide tasks into fine-grained subproblems Elegant algorithms and comfortable programming Communicators make life easier at no cost!? 3

5 Current Implementations OpenMPI and MPICH PE group Mapping from PE ID to process ID required Explicit representation as table Subcommunicator 0 3; 1 4; Context ID Separates communication between communicators part of each message Unique for all PEs of the PE group Blocking Allgather-operation on context ID mask Subcommunicator Context ID 0 Subcommunicator Context ID 1 3/0 2/

6 Current Implementations OpenMPI and MPICH PE group Mapping from PE ID to process ID required Explicit representation as table Subcommunicator 0 3; 1 4; 2 1 Communicator creation takes time linear to the communicator size Context ID Separates communication between communicators part of each message Unique for all PEs of the PE group Blocking Allgather-operation on context ID mask Subcommunicator Context ID 0 Subcommunicator Context ID 1 3/0 2/

7 Current Implementations OpenMPI and MPICH PE group Mapping from PE ID to process ID required Explicit representation as table Subcommunicator 0 3; 1 4; 2 1 Communicator creation takes time linear to the communicator size Context ID Separates communication between communicators part of each message Unique for all PEs of the PE group Blocking Allgather-operation on context ID mask Subcommunicator Context ID 0 Subcommunicator Context ID 1 Communicator creation is a blocking collective operation 3/0 2/

8 Blocking Communicator Creation... nonblocking collective operations can mitigate possible synchronizing effects enabling communication-computation overlap perform collective operations on overlapping communicators, which would lead to deadlocks with blocking operations. MPI Standard A collective operation is invoked by all PEs of a communicator BUT: Communicator creation breaks nonblocking idea Nonblocking collective Nonblocking collective with communicator creation 0 IScan Compute and Test 0 Communicator creation IScanTest IScan Test 2 5

9 Communicator Construction Splitting a communicator into two communicators of half the size Splitting PE Running Time / Comm Size [µs] IBM MPI Comm create group IBM MPI Comm split Intel MPI Comm create group Intel MPI Comm split Comm Size (Cores) Communicator construction time is linear to PE group size 6 SuperMUC

10 Communicator Construction Splitting 2 15 PEs into two communicators of size 2 14 Collective operation on 2 14 cores Splitting Collective operation PE Running Time [ms] MPI_Reduce MPI_Exscan MPI_Comm_split Message Length [B] Communicator construction is expensive compared to collectives 7 SuperMUC cores IBM MPI

11 Communicator Construction Splitting a communicator into overlapping communicators of size four PE group invokest MPI Comm create group Alternating Cascading Time PE PE Running time [ms] Alternating Cascade Comm Size (Cores) Blocking communicator creation causes delays 8 SuperMUC Intel MPI

12 Proposals for MPI PE group Sparse representations E.g. MPI_Group_range_incl Subcommunicator f i r s t =1, l a s t =4, s t r i d e =2 0 1 Context ID User-defined tag Calculate by MPI: Concatenation of counters Subcommunicator Context ID 0 Subcommunicator Context ID 1 3/0 2/ MPI COMM WORLD {0} {1} {2} {3} {0,0} {0,1} {2,0} {2,1} {2,2} 9

13 Our RBC library Range-based communicator in O(1) time Local construction Select MPI or RBC operations Local splitting: Split_RBC_Comm(Comm&, Comm&, tttint first, int last, int stride) Only adjust range Blocking Ops Nonblocking Ops Classes Local Ops rbc::bcast rbc::ibcast rbc::request rbc::create RBC Comm rbc::reduce rbc::ireduce rbc::comm rbc::split RBC Comm rbc::allreduce rbc::iallreduce RBC::Comm rank rbc::scan rbc::iscan rbc::comm size rbc::gather rbc::igather rbc::gatherv rbc::igatherv rbc::barrier rbc::ibarrier rbc::send rbc::isend rbc::recv rbc::irecv rbc::probe rbc::iprobe rbc::wait rbc::test rbc::waitall rbc::comm + parent comm : MPI Comm + first : int + last : int + stride : int Initial MPI communicator f i r s t =1, l a s t =7, s t r i d e =3 10

14 Our RBC library Implementation Details (Non)blocking point-to-point communication Maps rank to rank of MPI communicator Call MPI counterpart (Non)blocking collective operations Calls point-to-point operations of RBC One globally reserved tag Nonblocking details Optional user-defined tag Round-based schedule rbc::ibcast(void *buff, int cnt, MPI_Datatype datatype, int root, tttrbc::comm comm, rbc::request *request, int tag = RBC_IBCAST_TAG) Time Broadcast PE 11

15 RBC vs. MPI Splitting a communicator into two communicators of half the size Splitting PE Running Time / Comm Size [µs] IBM MPI Comm create group IBM MPI Comm split Intel MPI Comm create group Intel MPI Comm split RBC rbc::split_rbc_comm Comm Size (Cores) RBC splitting comes with almost no cost 12 SuperMUC

16 RBC vs. MPI Splitting a communicator with 2 15 PEs into two communicators of size 2 14 and performing a broadcast operation on both communicators PE Splitting Collective operation Running time [ms] Message Length [B] RBC splitting comes with almost no cost IBM MPI Ibcast IBM rbc::ibcast 13 SuperMUC cores IBM MPI

17 Cost of Communicators Splitting a communicator into overlapping communicators of size four PE group invokest MPI Comm create group Alternating Cascading Time PE PE Running time [ms] MPI Alternating RBC Cascade MPI Cascade RBC Alternating Comm Size (Cores) Cascades do not have an effect on RBC 14 SuperMUC Intel MPI

18 Quicksort on Distributed Systems Single ported message passing Sending of l machine words: α + βl Analyze critical path Small inputs Minimal latency O(α log 2 p + β n p log p + n p log n) Sort locally Splitter selection PE 1 PE 2 PE 3 PE 4 Hypercube Quicksort Static communication pattern Precomputable communicators Bad for skewed inputs Only works for p = 2 k Split Redistribution Merge Recurse Subcube 1 Subcube 2 15

19 Janus Sort O(α log 2 p + βp n log p + p n log n) Arbitrary p Calculate communicators on the fly Perfectly balanced data Roman God Janus Twofaced 1 PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 Local sort Recursion step Group 1 Group 2 16 Image taken by Fubar Obfusco

20 Janus Sort group j group j + 1 PE i Janus PE PE i + 2 PE i + 3 PE i + 4 execution 17

21 Janus Sort group j group j + 1 PE i Janus PE PE i + 2 PE i + 3 PE i + 4 execution 17

22 Janus Sort group j group j + 1 PE i Janus PE PE i + 2 PE i + 3 PE i + 4 execution partitioning partitioning 17

23 Janus Sort group j group j + 1 PE i Janus PE PE i + 2 PE i + 3 PE i + 4 execution partitioning data exchange partitioning 17

24 Janus Sort group j group j + 1 PE i Janus PE PE i + 2 PE i + 3 PE i + 4 execution partitioning data exchange partitioning comm creation comm creation Janus PE Janus PE PE i + 2 Janus PE PE i + 4 base case group k group k

25 Janus Sort group j group j + 1 PE i Janus PE PE i + 2 PE i + 3 PE i + 4 execution partitioning data exchange partitioning comm creation comm creation Janus PE Janus PE PE i + 2 Janus PE PE i + 4 base case group k group k + 1 Pivot selection: O(α log p) Regular PE IBcast Wait Left group IBcast Janus PE Waitall Right group IBcast 17

26 Janus Sort group j group j + 1 PE i Janus PE PE i + 2 PE i + 3 PE i + 4 execution partitioning data exchange partitioning comm creation comm creation Janus PE Janus PE PE i + 2 Janus PE PE i + 4 base case group k group k + 1 Pivot selection: O(α log p) Regular PE Left group Janus PE Right group Partitioning: O(log n p ) Binary search Binary search Binary search 17

27 Janus Sort group j group j + 1 PE i Janus PE PE i + 2 PE i + 3 PE i + 4 execution partitioning data exchange partitioning comm creation comm creation Janus PE Janus PE PE i + 2 Janus PE PE i + 4 base case group k group k + 1 Pivot selection: O(α log p) Partitioning: O(log n p ) Data exchange: O(α + β n p ) Regular PE IScan Wait IBcast Wait ISend IRecv Waitall Left group IScan IBcast ISend IRecv Janus PE Waitall Waitall Waitall Right group IScan IBcast ISend IRecv 17

28 Janus Sort group j group j + 1 PE i Janus PE PE i + 2 PE i + 3 PE i + 4 execution partitioning data exchange partitioning comm creation comm creation Janus PE Janus PE PE i + 2 Janus PE PE i + 4 base case group k group k + 1 Pivot selection: O(α log p) Regular PE Left group New Janus PE Right group Partitioning: O(log n p ) Data exchange: O(α + β n p ) Comm creation: O(1) Create comm Creation of even subcommunicator Creation of odd subcommunicator 17

29 Janus Sort Experimental Results Sorting of double values cores Weak scaling on the SuperMUC double values per core Juqueen RBC Split MPI Split SuperMUC RBC Split MPI Split RBC Split MPI Split Run time [s] Run Time [s] Elements per core Cores 18 IBM MPI

30 Conclusion Communicator creation is expensive Blocking communicator creation breaks idea of nonblocking communication Extension for MPI Range-based communicators Case study on sorting Code published at 19

Lecture 3: Sorting 1

Lecture 3: Sorting 1 Sorting Arranging an unordered collection of elements into monotonically increasing (or decreasing) order. S = a sequence of n elements in arbitrary order After sorting: