Tutorial: Application MPI Task Placement

Size: px

Start display at page:

Download "Tutorial: Application MPI Task Placement"

Frederica Wilkinson
5 years ago
Views:

1 Tutorial: Application MPI Task Placement Juan Galvez Nikhil Jain Palash Sharma PPL, University of Illinois at Urbana-Champaign

2 Tutorial Outline Why Task Mapping on Blue Waters? When to do mapping? How to do mapping? Example: MILC (near-neighbor comm pattern, >2x speedup) Example: Qbox (subgroup Alltoalls, seen 1.57x speedup) Focus on automatic tool

3 Topology-Aware Task Placement What is topo-aware task placement? Place tasks so that communicating tasks are closer in the machine topology: minimize off-node traffic and link contention Not the same as simple rank reordering It takes into account the actual nodes in the job and distance in the network between MPI tasks Why topo-aware task placement? Execution time can improve significantly (> 2x) in large jobs in some applications compared to default BW mapping

4 Mapping example: 2D application grid to 2D torus D application grid, ranks ordered by row Near-neighbor communication Each cell MPI task

5 Mapping example: 2D application grid to 2D torus If same shape, preserve structure

6 Mapping example: Random Placement Bad placement Longer paths Higher link utilization Link contention -> Longer communication time

7 2D 10x10 virtual grid 2D 10x10 Torus Near-neighbor communication, each task sends unit load to each neighbor Random shortest path for each pair of communicating ranks Avg hops Max hops Avg link load Max link load Row order Random placement

8 2D 10x10 virtual grid 2D 5x20 Torus Avg hops Max hops Avg link load Row order Random placement (near)optimal Max link load

9 Default BW mapping N-1 Tasks in rank order Node list N0 N1 N2 No real guarantee on the quality of this map, communicating tasks may not be close to each other Can work well on some shapes, and badly on others Better mapping recommended to obtain consistent performance regardless of shape OR always use a shape that gives good performance

10 How important is task placement for my application? Application comm patterns (near-neighbor, collectives) Volume of communication Computation to communication ratio Allocation shape Job size: larger jobs have more communication, and potentially larger distances between tasks Some applications (e.g. MILC) can get more than 2x-3x speedup in execution time on large jobs

11 When to do mapping? Perhaps easiest solution is to use automatic tool and test to see if there are actual improvements. Worst case: no improvement Remember that it might only improve in certain job sizes and job shapes Analyze the communication behavior of the application Knowledge of application: computation to communication ratio, volume of point to point traffic, collectives pattern If user doesn't have this knowledge Profiling tools

12 Profiling Tools CrayPat Large suite of performance analysis tools, including tracing mpip Link-time library: no need to recompile application Presents statistical information on MPI calls TopoProfiler Link-time library Collects information on MPI calls Includes topological information (hop-bytes, rank placement in 3D torus) from the run Extract the communication graph (messages and bytes communicated between MPI tasks)

13 Using TopoProfiler 1. Build topoprofiler libraries 2. Link with your application 3. After running the application, files generated: Summary file (optional) Comm graph Collectives stats file

14 TopoProfiler Output Total Execution Time: Total Messages: Total Bytes: Average Messages (rank average): Min Messages: on Rank: 0 Max Messages: on Rank: 0 Average Bytes (rank average): Min Bytes: on Rank: 0 Max Bytes: on Rank: 0 Average Hopbytes (rank average): Min Hopbytes: on Rank: 1672 Max Hopbytes: on Rank: 3040 Average Idle Time (rank average): Min Idle Time: on Rank: 4850 Max Idle Time: on Rank: 891 Total run time Messages Bytes hopbytes ) = Communication time 3 hops(m, n) bytes(m, n) totalbytes )

15 TopoProfiler Output Average Point to Point Operation Time (rank average): Min Point to Point Operation Time: on Rank: 4177 Max Point to Point Operation Time: on Rank: 6783 Average Collective Operation Time (rank average): Min Collective Operation Time: on Rank: 2642 Max Collective Operation Time: on Rank: 3593 Average number of Send Calls (rank average): Min number of Send Calls: on Rank: 0 Max number of Send Calls: on Rank: 0 Average number of Recv Calls (rank average): Min number of Recv Calls: on Rank: 0 Max number of Recv Calls: on Rank: 0 Average number of Collective Operations (rank average): Min number of Collective Operations: on Rank: 0 Max number of Collective Operations: on Rank: 0 Time in point to point comm Time in collectives

16 TopoProfiler Output MPI Rank X Y Z T Messages Bytes Hop-bytes Idle Time Send Calls Recv Calls Collectives: timing, number of calls, bytes Option to contribute some collectives to per-rank hop-bytes metric, e.g. to know how close ranks in a subgroup collective are close to each other

17 How to do mapping? In the application: hard Ways to access topological info from within application (e.g. TopoManager) CrayPat grid_order tool Topaware genm Works with any MPI application No source code modification necessary Works with any communication pattern, including irregular

18 CrayPat grid_order: Generates MPI rank order information based on grid pattern detected by CrayPat profiling tools Calculated off-line, can place communicating tasks in the same node, doesn t explicitly place communicating nodes nearby Doesn t deal with irregular patterns

19 Topaware Run on login node, user gives application virtual topology grid. Topaware returns the set of geometries on which to run the job Inside batch job, call topaware to calculate map for the given allocation Limitation: Need to know shape of application grid (e.g. 4D application grid partitioned in 8x8x8x16 tasks) Tool assumes near-neighbor comm pattern in that grid Need to run in specific geometries Very good performance results in MILC

20 genm Communication graph as Input Goal: Minimize avg (per-rank) hopbytes hopbytes > = 3 bytes a, n distance(a, n) Keep communicating tasks close, preference based on amount of communication Doesn t necessarily correlate perfectly with application execution time, but good results Additionally favor solutions with smaller max hopbytes Hopbytes in y direction given 2x more weight

21 genm algorithm Algorithm: place tasks in processors to minimize hopbytes NP-hard problem Heuristic algorithm: random component, and greedy component Multiple trials of random greedy search Other algorithms not used either because too slow or find worse solutions We run separate instance of algorithm on every processor

22 Running genm Inside batch job: aprun -n NUM_PROCESSORS./mapper comm_graph.txt export MPICH_RANK_REORDER_METHOD=3 aprun -n NUM_PROCESSORS./program args Runs on any geometry and arbitrary topologies Works with BW service nodes ( holes ) Mapper runs on every processor. Two reasons: Allows increasing the search space (subsets of processors search a different solution space) With large problems, compensate for the fact that a single processor can do less trials

23 Mapping Examples: Methodology Profile: Analyze/confirm communication pattern Detect communication issues, e.g. high hopbytes, high point-to-point time, high collectives time Collectives used and timings of each Apply genm automatic task mapping

24 Example: MILC MILC - Lattice QCD 4D lattice (3 spatial dimensions + time) Extremely sensitive to placement Near-neighbor communication pattern

25 MILC Profiling Results tasks, 8x8x8 shape, Default BW mapping Total Execution Time: Average Hopbytes (rank average): Min Hopbytes: on Rank: 6624 Max Hopbytes: on Rank: Average Idle Time (rank average): ~40% time on communication Min Idle Time: on Rank: 4596 Max Idle Time: on Rank: Average Point to Point Operation Time (rank average): Min Point to Point Operation Time: on Rank: 3989 Max Point to Point Operation Time: on Rank: Average Collective Operation Time (rank average): Min Collective Operation Time: on Rank: Max Collective Operation Time: on Rank: 6041

26 MILC Profiling Results tasks, 11x2x24 shape, Default BW mapping Total Execution Time: Average Hopbytes (rank average): Min Hopbytes: on Rank: 2528 Max Hopbytes: on Rank: 7536 Average Idle Time (rank average): ~63% time on communication Min Idle Time: on Rank: 9940 Max Idle Time: on Rank: 7177 Average Point to Point Operation Time (rank average): Min Point to Point Operation Time: on Rank: 2665 Max Point to Point Operation Time: on Rank: Average Collective Operation Time (rank average): Min Collective Operation Time: on Rank: 9173 Max Collective Operation Time: on Rank: 2665

27 MILC mapping results on BW MILC 8192 ranks Program execution time (s) topaware genm default 8x4x8 11x2x12 6x6x8 Allocation shape (X * Y * Z)

28 MILC mapping results on BW MILC ranks Program execution time (s) topaware genm default 8x8x8 11x6x8 11x2x24 Allocation shape (X * Y * Z)

29 Example: Qbox First-principles MD code developed at LLNL Compute the properties of materials directly from underlying physics equations Density Functional Theory The computational workload split between: Parallel dense linear algebra (ScaLAPACK library) Custom 3D Fast Fourier Transform Gordon Bell Prize 06 for scaling on BG/L

30 Qbox communication MPI_alltoall 2D Process grid MPI_Allreduce

31 Qbox Profiling Results Total Execution Time: Average Hopbytes (rank average): Min Hopbytes: on Rank: 5746 Max Hopbytes: on Rank: 3007 Average Idle Time (rank average): ~35% of execution time Min Idle Time: on Rank: 6076 Max Idle Time: on Rank: 1535 Average Point to Point Operation Time (rank average): Min Point to Point Operation Time: on Rank: 79 Max Point to Point Operation Time: on Rank: 7167 Average Collective Operation Time (rank average): Min Collective Operation Time: on Rank: 6076 Max Collective Operation Time: on Rank: 1279

32 Example: Qbox Profiling results Most of the communication time is spent in Alltoall. These involve subgroups (not global Alltoall). Example: 84 MPI groups of size 256 spend > 20 secs on collectives (avg=28, max=38) Time spent by these groups in MPI_alltoallv (avg=24, max=26) Mapping strategy: cluster these subgroups together OR: Let automatic mapping do it

33 Qbox Mapping Results on BW Qbox 8192 tasks Qbox tasks Program execution time (s) genm default Program execution time (s) genm default 11x3x8 Allocation shape (X * Y * Z) 11x2x12 11x2x24 Allocation shape (X * Y * Z)

34 Thank you! Contact me:

Update on Topology Aware Scheduling (aka TAS)

Update on Topology Aware Scheduling (aka TAS) Work done in collaboration with J. Enos, G. Bauer, R. Brunner, S. Islam R. Fiedler Adaptive Computing 2 When things look good Image credit: Dave Semeraro 3