MPI Profile (mpip) on IRS BenchMark Application

Size: px

Start display at page:

Download "MPI Profile (mpip) on IRS BenchMark Application"

Alannah Young
5 years ago
Views:

1 MPI Profile (mpip) on IRS BenchMark Application

2 MPI with ZRAD8 -np=2 -np=6 -np=1 Call App% MPI% App% MPI% App% MPI% App% MPI% App% MPI% any Isend all Irecv % of App Time On Each MPI Call Using MPI and zrad.8 (top 4) 2 2 any 1 1 -np=2 -np=6 -np=1 % of MPI Time On Each MPI Call Using MPI and zrad.8 (top 4) any np=2 -np=6 -np=1 * "any","allreduce" and "" takes almost 3% of application time in the case unevenly distribute blocks among nodes. The case of "-np=1" almost the same "", it is due to number of block is smaller then number of node. zrad.8 * This is an 8 block problem, suitable for testing with at least 2 MPI Processes and up to 8 CPU's. The following types of runs should work well 8 MPI Processes 4 MPI Processes at 2 Threads each

3 MPI with ZRAD64 -np=2 -np=6 -np=1 Call App% MPI% App% MPI% App% MPI% App% MPI% App% MPI% any Isend all Irecv % of APP Time On Each MPI Call Using MPI and zrad.64 (top 4) 2 2 any 1 1 -np=2 -np=6 -np=1 % of MPI Time On Each MPI Call Using MPI and zrad.64 (top 4) any np=2 -np=6 -np=1 * "any","allreduce" and "" takes almost % of application time when number of block is much larger then number of nodes. It is even worse when block is unevenly distrubuted among nodes, like the case "-np=6", "-np=1" zrad.64 This is a 64 block problem, suitable for testing with at least 2 MPI Processes and up to 64 CPU's.The following types of runs should work well 64 MPI Processes 32 MPI Processes at 2 Threads each 16 MPI Processes at 4 Threads each 8 MPI Processes at 8 Threads each 4 MPI Processes at 16 Threads each 2 MPI Processes at 32 Threads each

4 MPI/OpenMP 2 Threads with ZRAD8 -np=2 -np=6 Call App% MPI% App% MPI% App% MPI% App% MPI% any Isend all Irecv % of APP Time On Each MPI Call Using MPI+OpenMP 2 Threads and zrad.8 (top 4) any 1 1 -np=2 -np= % of MPI Time On Each MPI Call Using MPI+OpenMP 2 Threads and zrad.8 (top 4) any np=2 -np=6 * MPI/OpenMP reduces significantly MPI time, when distributes multiple blocks to each single node. The communcation among processes within a node is faster then communication among processes between nodes.

5 MPI/OpenMP 2 Threads with ZRAD8 -np=2 -np=6 Call App% MPI% App% MPI% App% MPI% App% MPI% any Isend all Irecv % App Time Spent On MPI Calls Using MPI+OpenMP 4 Threads and zrad.8 (top 4) 3 2 any np=2 -np=6 % of MPI Time On Each MPI Call Using MPI+OpenMP 4 Threads and zrad.8 (top 4) any np=2 -np=6 * Compare to MPI/OpenMP with 2 threads;in MPI/OpenMP with 4 threads MPI calls does not show any mjor change. Increasing number of OpenMP threads does not lead to any change MPI time.

6 MPI/OpenMP 2 Threads with ZRAD64 Call App% MPI% App% MPI% any Isend all Irecv % of APP Time On Each MPI Call Using MPI+OpenMP 2 Threads and zrad.64 (top 4) waitany % of MPI Time On Each MPI Call Using MPI+OpenMP 2 Threads and zrad.64 (top 4) any * When increasing input size, It likely need to increase number of nodes. In case of "np=2", "any","" and "" takes longer then "any" in the case of "". When number of nodes small, there is likely lot of comminication between nodes and also facing problem of call blocking and synchronzation.

7 MPI/OpenMP 4 Threads with ZRAD64 Call App% MPI% App% MPI% any Isend all Irecv % of APP Time On Each MPI Call Using MPI+OpenMP 4 Threads and zrad.64 (top 4) any % of MPI Time On Each MPI Call Using MPI+OpenMP 4 Threads and zrad.64 (top 4) any * Compare to MPI/OpenMP with 2 threads;in MPI/OpenMP with 4 threads MPI calls does not show any mjor change. Increasing number of OpenMP threads does not lead to any change MPI time.

8 CONCLUSION any,"allrecude" and "" is three major MPI calls that may have impact on a large scale application like IRS when it runs on cluster environment. There are few factors that may recude MPI call impact on application time: * Number of nodes: When the size of data increase, number of nodes should inceases. * Distributing Data: Minizing data dependecy on each node that help minize data exchange between nodes. This task also depend on number of * Number of threads: This does not directly improve MPI call, but it help to determine how to distribute data among nodes in SMP enviroment.

Efficient Clustering and Scheduling for Task-Graph based Parallelization

Center for Information Services and High Performance Computing TU Dresden Efficient Clustering and Scheduling for Task-Graph based Parallelization Marc Hartung 02. February 2015 E-Mail: marc.hartung@tu-dresden.de