Improving the interoperability between MPI and OmpSs-2

Size: px

Start display at page:

Download "Improving the interoperability between MPI and OmpSs-2"

Benedict Ross
5 years ago
Views:

1 Improving the interoperability between MPI and OmpSs-2 Vicenç Beltran Querol 19/04/2018 INTERTWinE Exascale Application Workshop, Edinburgh

2 Why hybrid MPI+OmpSs-2 programming? Gauss-Seidel method Pure MPI MPI + OmpSs (fork-join) MPI + OmpSs (tasks + sentinel) OmpSs-2 pause/resume API Task-Aware MPI (TAMPI) library Gauss-Seidel method MPI + OmpSs (tasks + TAMPI) Evaluation Conclusions Outline

3 Why hybrid MPI+OmpSs-2 programming? Try to leverage best of both programing models... Message Passing Interface (MPI) Designed to exploit distributed memory systems Efficient and scalable message passing interface OmpSs-2 tasking model Designed to exploit shared memory system Write sequential code, but execute it in parallel Fine grained synchronizations Automatic load-balancing but also to exploit some potential synergies J Fine-grained synchronization across nodes Overlap of computation and communication phases Leverage intra-node application parallelism to hide network latency and maximize network throughput However, interoperability issues between MPI and OmpSs-2 prevents application developers to achieve most of these goals L

4 Gauss-Seidel method: Sequential In-place iterative algorithm Ex: 3 x 3 tile domain i i i-1 Rank 0 i-1 Rank 1 i-1 nk 2 nk 3 1 Each tile depend on top and left tile from current iteration and right and bottom tile from previous iteration nk 0 nk 1 0 Rank 2 Task that computes a block on the i-th iteration Rank 0 Rank 1 Rank 2 Rank 3 Rank 3

5 Gauss-Seidel method: Pure MPI Ex: 12 x 3 blocks domain, decomposition across 4 MPI ranks Rank 0 Rank 1 lock Rank 2 Rank 3 After each iteration, neighbor MPI ranks has to exchange halos k 0 Rank 1 Rank 2 Rank 3 Data dependency MPI communication Rank 0 Rank 1 Rank 2 Rank 3 Data dependency MPI communication

6 Gauss-Seidel method: Pure MPI void solve(block_t *matrix, int rowblocks, int colblocks, int timesteps) int rank, rank_size; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &rank_size); for (int t = 0; t < timesteps; ++t) solvegaussseidel(matrix, rowblocks, colblocks, rank, rank_size); MPI_Barrier(MPI_COMM_WORLD);

7 Gauss-Seidel method: Pure MPI void solvegaussseidel(block_t *matrix, int nbx, int nby, int rank, int rank_size) if (rank!= 0) sendfirstcomputerow(matrix, nbx, nby, rank, rank_size); receiveupperborder(matrix, nbx, nby, rank, rank_size); if (rank!= rank_size - 1) receivelowerborder(matrix, nbx, nby, rank, rank_size); for (int bx = 1; bx < nbx-1; ++bx) for (int by = 1; by < nby-1; ++by) solveblock(matrix, nbx, nby, bx, by); if (rank!= rank_size - 1) sendlastcomputerow(matrix, nbx, nby, rank, rank_size); void sendlastcomputerow(block_t *matrix, int nbx, int nby, int rank, int rank_size) for (int by = 1; by < nby-1; ++by) MPI_Send(&matrix[(nbx-2)*nby + by][bsx-1], BSY, MPI_DOUBLE, rank + 1, by, MPI_COMM_WORLD); void receiveupperborder(block_t *matrix, int nbx, int nby, int rank, int rank_size) Fig. 4. Left: simplified dependency diagra Right: simplified dependency diagram of interoperability library and with the interop cies) for (int by = 1; by < nby-1; ++by) MPI_Recv(&matrix[by][BSX-1], BSY, MPI_DOUBLE, rank - 1, by, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

8 Gauss-Seidel method: Pure MPI void solveblock(block_t *matrix, int nbx, int nby, int bx, int by) block_t &targetblock = matrix[bx*nby + by]; const block_t &centerblock = matrix[bx*nby + by]; const block_t &topblock = matrix[(bx-1)*nby + by]; const block_t &leftblock = matrix[bx*nby + (by-1)]; const block_t &rightblock = matrix[bx*nby + (by+1)]; const block_t &bottomblock = matrix[(bx+1)*nby + by]; for (int x = 0; x < BSX; ++x) const row_t &toprow = (x > 0)? centerblock[x-1] : topblock[bsx-1]; const row_t &bottomrow = (x < BSX-1)? centerblock[x+1] : bottomblock[0]; for (int y = 0; y < BSY; ++y) double left = (y > 0)? centerblock[x][y-1] : leftblock[x][bsy-1]; double right = (y < BSY-1)? centerblock[x][y+1] : rightblock[x][0]; targetblock[x][y] = 0.25 * (toprow[y] + bottomrow[y] + left + right);

9 Gauss-Seidel method: Fork-Join OmpSs-2 used to execute in parallel the computational phase of the program MPI used only on sequential phase of the program for communications No overlapping of communication an computation phases void solvegaussseidel(block_t *matrix, int nbx, int nby, int rank, int rank_size) if (rank!= 0) sendfirstcomputerow(matrix, nbx, nby, rank, rank_size); receiveupperborder(matrix, nbx, nby, rank, rank_size); if (rank!= rank_size - 1) receivelowerborder(matrix, nbx, nby, rank, rank_size); for (int bx = 1; bx < nbx-1; ++bx) for (int by = 1; by < nby-1; ++by) #pragma oss task \ in(([nbx][nby]matrix)[bx-1][by]) \ in(([nbx][nby]matrix)[bx][by-1]) \ in(([nbx][nby]matrix)[bx][by+1]) \ in(([nbx][nby]matrix)[bx+1][by]) \ inout(([nbx][nby]matrix)[bx][by]) solveblock(matrix, nbx, nby, bx, by); #pragma oss taskwait if (rank!= rank_size - 1) sendlastcomputerow(matrix, nbx, nby, rank, rank_size); 1 Rank x Node!!!

10 Gauss-Seidel method: Tasks + sentinel Tasks used for both computations and communications Tags used to match send and receive operations but... Communication tasks have to be serialized to avoid deadlocks Partial overlapping of communication an computation phases void solvegaussseidel(block_t *matrix, int nbx, int nby, int rank, int rank_size) if (rank!= 0) sendfirstcomputerow(matrix, nbx, nby, rank, rank_size); receiveupperborder(matrix, nbx, nby, rank, rank_size); if (rank!= rank_size - 1) receivelowerborder(matrix, nbx, nby, rank, rank_size); for (int bx = 1; bx < nbx-1; ++bx) for (int by = 1; by < nby-1; ++by) #pragma oss task \ in(([nbx][nby]matrix)[bx-1][by]) \ in(([nbx][nby]matrix)[bx][by-1]) \ in(([nbx][nby]matrix)[bx][by+1]) \ in(([nbx][nby]matrix)[bx+1][by]) \ inout(([nbx][nby]matrix)[bx][by]) solveblock(matrix, nbx, nby, bx, by); if (rank!= rank_size - 1) sendlastcomputerow(matrix, nbx, nby, rank, rank_size); 1 Rank x Node!!!

11 Gauss-Seidel method: Tasks + sentinel Tasks used for both computations and communications Tags used to match send and receive operations but... Communication tasks have to be serialized to avoid deadlocks Partial overlapping of communication an computation phases void sendlastcomputerow(block_t *matrix, int nbx, int nby, int rank, int rank_size) for (int by = 1; by < nby-1; ++by) #pragma oss task in(([nbx][nby]matrix)[nbx-2][by]) inout(*serial) MPI_Send(&matrix[(nbx-2)*nby + by][bsx-1], BSY, MPI_DOUBLE, rank + 1, by, MPI_COMM_WORLD); void receiveupperborder(block_t *matrix, int nbx, int nby, int rank, int rank_size) for (int by = 1; by < nby-1; ++by) #pragma oss task out(([nbx][nby]matrix)[0][by]) inout(*serial) MPI_Recv(&matrix[by][BSX-1], BSY, MPI_DOUBLE, rank - 1, by, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

12 Gauss-Seidel method: Tasks + sentinel Why it is needed to serialize all communication tasks? Because tasks can be scheduled out-of-order, so MPI operations can also be executed out-of-order... All CPUs can stall executing MPI receive operations that depend on the eventual completion of MPI send operations of this rank, but these will never be executed! Create BLOCKED Unblock READY Create T 7 T 2 T 3 Unblock MPI_Send() Taskwait Schedule Blocked Tasks T 9 T 1 T 8 T 6 RUNNING MPI_receive() Taskwait Ready Tasks Complet FINISHED T 4 CPU T 5 T 6 T 8 CPU CPU CPU Schedule Running Tasks

13 OmpSs-2: Pause/resume API Low-level API to programmatically pause and resume the execution of a task. void * nanos_get_current_blocking_context(); // Get task id void nanos_block_current_task(void *context); // Block task void nanos_unblock_task(void *context); // Unblock task Create BLOCKED Unblock READY T 7 T 2 T 3 Unblock or Resume T 9 T 1 T 8 T 6 Taskwait RUNNING Schedule Pause PAUSED Resume Paused & Blocked Tasks Ready Tasks FINISHED Complet Taskwait or Pause T 4 CPU T 5 T 6 T 8 CPU CPU CPU Schedule Running Tasks

14 Task-Aware MPI (TAMPI) library Leverage the low-level pause/resume API to improve the interoperability between MPI and OmpSs-2 tasks Expose this new feature as a new threading support level in MPI: MPI_TASK_MULTIPLE When TAMPI is initialized using the MPI_TASK_MULTIPLE threading model, all the blocking operations are intercepted and converted to their nonblocking counter parts. Ex: MPI_Recv() executed inside a task. int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) int err, completed = 0; if (Interop::isEnabled()) MPI_Request request; err = MPI_Irecv(buf, count, datatype, source, tag, comm, &request); MPI_Test(&request, &completed, status); if (!completed) Ticket ticket(&request, status); ticket._waiter = get_current_blocking_context(); _pendingtickets.add(ticket); block_current_task(ticket._waiter); return err; return PMPI_Recv(buf, count, datatype, source, tag, comm, status);

15 Task-Aware MPI (TAMPI) library TAMPI library registers a polling service with the OmpSs-2 runtime to check the completion of in-flight MPI operations. Once a MPI operation is completed the task waiting for this operation is put again into the ready queue The polling service is executed by the runtime worker-threads periodically Resume T T 7 3 T T T 9 T 1 T Paused Tasks Ready Tasks void Interop::poll() for (Ticket &ticket : _pendingtickets) int completed = 0; MPI_Test(ticket._request, &completed, ticket._status); if (completed) _pendingtickets.remove(ticket); unblock_task(ticket._waiter); Pause MPI Blocking Operation T 4 CPU MPI Req T 5 T 6 T 8 CPU MPI Req TAMPI MPI Req CPU Polling Service CPU Schedule Running Tasks

16 Gauss-Seidel method: Tasks + TAMPI Tasks used for both computations and communications Tags used to match send and receive operations but... Full overlapping of communication an computation phases void sendlastcomputerow(block_t *matrix, int nbx, int nby, int rank, int rank_size) for (int by = 1; by < nby-1; ++by) #pragma oss task in(([nbx][nby]matrix)[nbx-2][by]) MPI_Send(&matrix[(nbx-2)*nby + by][bsx-1], BSY, MPI_DOUBLE, rank + 1, by, MPI_COMM_WORLD); Rank 0 Rank 1 Rank 2 Rank 3 Data dependency MPI communication MPI serialization Time void receiveupperborder(block_t *matrix, int nbx, int nby, int rank, int rank_size) for (int by = 1; by < nby-1; ++by) #pragma oss task out(([nbx][nby]matrix)[0][by]) MPI_Recv(&matrix[by][BSX-1], BSY, MPI_DOUBLE, rank - 1, by, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

17 Tiled Gauss-Seidel: Results BS=1024 & 1000 iterations Pure MPI: 48 ranks/node Hybrids: 1 rank/node & 48 cores/rank Baseline: Pure MPI Up to 8x speedup w.r.t. Pure MPI

18 Tiled Gauss-Seidel: Traces Pure MPI 4 nodes 100 iterations 32K x 32K matrix Same Time Duration MPI + Fork-Join MPI + Tasks (sentinel) MPI + Tasks + Interop

communication phases Load-balancing Exposes more parallelism (remove barriers

19 Conclusions TAMPI library benefits Provides inter-node fine grained-synchronization across tasks Automatic overlap of computation and communication phases Load-balancing Exposes more parallelism (remove barriers and artificial dependencies) Do not increase application complexity Fork-jon Task+sentinel Task+TAMPI

Extending the Task-Aware MPI (TAMPI) Library to Support Asynchronous MPI primitives

Extending the Task-Aware MPI (TAMPI) Library to Support Asynchronous MPI primitives Kevin Sala, X. Teruel, J. M. Perez, V. Beltran, J. Labarta 24/09/2018 OpenMPCon 2018, Barcelona Overview TAMPI Library