Characterizing Imbalance in Large-Scale Parallel Programs. David Bo hme September 26, 2013

Size: px

Start display at page:

Download "Characterizing Imbalance in Large-Scale Parallel Programs. David Bo hme September 26, 2013"

Samuel Hall
5 years ago
Views:

1 Characterizing Imbalance in Large-Scale Parallel Programs David o hme September 26, 2013

depends on software performance Jugene (2008): 65 536 cores

2 Need for Performance nalysis Tools mount of parallelism in Supercomputers keeps growing Efficient resource usage depends on software performance Jugene (2008): cores Juqueen (2013): cores Slide 2 September 26, 2013 David öhme

3 Imbalance limits parallel efficiency Imbalance leads to wait states at synchronization points 1 2 Recv Send Wait state Late-sender wait state Goal: Locate inefficient parallelism and quantify its impact Slide 3 September 26, 2013 David öhme

4 Two novel analysis methods Root-cause analysis Critical-path analysis 1 Send 2 Recv Send 3 C Recv Determine delays that cause wait states. Slide 4 September 26, 2013 David öhme

5 Two novel analysis methods Root-cause analysis Critical-path analysis 1 Send 2 Recv Send 3 C Recv Identify activities that determine program run. Slide 4 September 26, 2013 David öhme

6 Outline Parallel Performance nalysis Root-Cause nalysis Concepts Case study Critical-Path nalysis The critical path Critical-path imbalance indicator nalysis of MPMD programs Implementation Parallel trace replay Scalability evaluation Slide 5 September 26, 2013 David öhme

7 Performance analysis tools Tools help understand performance, but Profiling provides limited insights Tracing produces too much data to analyze manually Time profile display in TU Event-trace line visualization in Vampir Slide 6 September 26, 2013 David öhme

8 Imbalance analysis pitfalls Profilers underestimate impact of imbalance Data aggregation hides dynamic imbalance effects Static imbalance Dynamic imbalance Slide 7 September 26, 2013 David öhme

9 Imbalance analysis pitfalls Profilers underestimate impact of imbalance Data aggregation hides dynamic imbalance effects Static imbalance Dynamic imbalance nalysis solution needs to retain performance dynamics Slide 7 September 26, 2013 David öhme

10 Imbalance analysis pitfalls Profilers underestimate impact of imbalance Data aggregation hides dynamic imbalance effects Static imbalance Dynamic imbalance nalysis solution needs to retain performance dynamics Use automatic trace analysis Slide 7 September 26, 2013 David öhme

11 nalysis workflow Instrumented executable Instrumenter compiler / linker Source modules Slide 8 September 26, 2013 David öhme

12 nalysis workflow pplication run Local event traces Instrumented executable Instrumenter compiler / linker Source modules Slide 8 September 26, 2013 David öhme

13 nalysis workflow pplication run Local event traces Parallel analysis Global analysis report Instrumented executable Instrumenter compiler / linker Source modules Slide 8 September 26, 2013 David öhme

14 nalysis workflow pplication run Local event traces Parallel analysis Global analysis report Instrumented executable Instrumenter compiler / linker Source modules Which problem? Where in the code? Which process? Slide 8 September 26, 2013 David öhme

15 Parallel Performance nalysis Root-Cause nalysis Concepts Case study Critical-Path nalysis The critical path Critical-path imbalance indicator nalysis of MPMD programs Implementation Parallel trace replay Scalability evaluation Slide 9 September 26, 2013 David öhme

16 Delay as root cause of wait states delay is some additional activity on one process that causes a wait state at a synchronization point 1 R 1 Work Delay S 2 2 S 1 Work R 2 Late-sender wait state Delay on process 1 causes a late-sender wait state on process 2 Slide 10 September 26, 2013 David öhme

17 Wait-state propagation Wait states can propagate 1 2 Delay 3 4 Slide 11 September 26, 2013 David öhme

18 Wait-state propagation Wait states can propagate Delay Propagation Propagation Slide 11 September 26, 2013 David öhme

19 Wait-state propagation Wait states can propagate ccount for propagation in analysis Extended wait-state classification Incorporate long-distance effects in calculation of delay costs Delay Propagation Propagation Slide 11 September 26, 2013 David öhme

20 Wait-state classification Distinguish propagating and terminal wait states 1 2 Delay 3 4 Slide 12 September 26, 2013 David öhme

21 Wait-state classification Distinguish propagating and terminal wait states Delay propagating wait state propagating wait state Slide 12 September 26, 2013 David öhme

22 Wait-state classification Distinguish propagating and terminal wait states Delay propagating wait state propagating wait state terminal wait state Slide 12 September 26, 2013 David öhme

23 ssigning delay costs Delay costs represent amount of wait caused by a delay Short-term costs represent wait states caused directly Long-term costs represent wait states caused via propagation short-term costs long-term costs long-term costs Slide 13 September 26, 2013 David öhme

24 Case study: CESM sea ice model nalysis of imbalance in CESM sea ice model Performance data mapped onto application topology Distribution of computation CICE setup: 2048 on G/P, 1 dipole grid, cartesian grid decomposition Slide 14 September 26, 2013 David öhme

computation Distribution of late-sender waiting CICE setup: 2048 on G/P,

25 Case study: CESM sea ice model nalysis of imbalance in CESM sea ice model Performance data mapped onto application topology Distribution of computation Distribution of late-sender waiting CICE setup: 2048 on G/P, 1 dipole grid, cartesian grid decomposition Slide 14 September 26, 2013 David öhme

26 CESM sea ice model: wait-state formation Propagating wait states Distribution of delay costs 25% Short-term 75% Long-term Terminal wait states Slide 15 September 26, 2013 David öhme

27 Parallel Performance nalysis Root-Cause nalysis Concepts Case study Critical-Path nalysis The critical path Critical-path imbalance indicator nalysis of MPMD programs Implementation Parallel trace replay Scalability evaluation Slide 16 September 26, 2013 David öhme

28 Critical-path analysis Use automatic trace analysis to extract the critical path 1 C 2 C 3 C Critical path in a parallel program (shown in red) Slide 17 September 26, 2013 David öhme

29 Critical-path analysis Use automatic trace analysis to extract the critical path Performance indicators show bottlenecks at single glance 1 C 2 C 3 C Critical path in a parallel program (shown in red) Slide 17 September 26, 2013 David öhme

30 Critical-path profile and imbalance Timeline Slide 18 September 26, 2013 David öhme

31 Critical-path profile and imbalance Sync 1 2 allocation Summary profile 3 Timeline Slide 18 September 26, 2013 David öhme

32 Critical-path profile and imbalance Sync 1 2 allocation Summary profile 3 Timeline wall-clock Critical-path profile Critical-path profile shows wall-clock consumption Slide 18 September 26, 2013 David öhme

33 Critical-path profile and imbalance Sync 1 2 allocation Summary profile 3 Timeline Imbalance wall-clock Critical-path profile Critical-path profile shows wall-clock consumption Critical imbalance indicator finds inefficient parallelism Imbalance = T critical T avg Slide 18 September 26, 2013 David öhme

34 Example: PEPC 30 verage Maximum nalysis of plasma-physics code PEPC using 512 on lue Gene/P Wall-clock tree_walk sum_force pplication kernel Slide 19 September 26, 2013 David öhme

35 Example: PEPC nalysis of plasma-physics code PEPC using 512 on lue Gene/P Profile metrics underestimate performance impact of tree walk kernel due to dynamic load imbalance Wall-clock verage Maximum Critical-path Critical-path imbalance 0 tree_walk sum_force pplication kernel Slide 19 September 26, 2013 David öhme

36 nalysis of MPMD programs Processes execute different activities E.g. master-worker Slide 20 September 26, 2013 David öhme Heterogeneous decomposition in Figure 2: Task layout on the luegene/p torus. Cyan arrows indicate communication from tasks to collector tasks, blue arrows indicate communication ddcmdd. from collector tasks to FFT tasks. Image from Richards et al.: eyond Homogeneous Decomposition, SC 10 is the reduction of the contributions to ρ at each point from these multiple tasks to a single sum. To perform the reduction we nominate a subset of the particle tasks as collector tasks (see Fig. 2). Each mesh point is uniquely assigned to a collector task that is responsible for gathering all contributions to ρ for that mesh point and performing the sum. The number and arrangement of the collector tasks is a tunable parameter and all communication is local. In Stage 2 of the communication, each collector task sends mesh information to the appropriate mesh task using MPI_Isend. We T ory an i the and relie tatio on T tage send task task wid mes our that stag com proa num wid is th com low task the 4.3 F pute para miz sub focu and task sing as i

37 nalysis of MPMD programs Processes execute different activities E.g. master-worker Complex imbalance analysis issues Not supported by existing tools Imbalance quantification needs to incorporate partition sizes More knobs to tune Slide 20 September 26, 2013 David öhme Heterogeneous decomposition in Figure 2: Task layout on the luegene/p torus. Cyan arrows indicate communication from tasks to collector tasks, blue arrows indicate communication ddcmdd. from collector tasks to FFT tasks. Image from Richards et al.: eyond Homogeneous Decomposition, SC 10 is the reduction of the contributions to ρ at each point from these multiple tasks to a single sum. To perform the reduction we nominate a subset of the particle tasks as collector tasks (see Fig. 2). Each mesh point is uniquely assigned to a collector task that is responsible for gathering all contributions to ρ for that mesh point and performing the sum. The number and arrangement of the collector tasks is a tunable parameter and all communication is local. In Stage 2 of the communication, each collector task sends mesh information to the appropriate mesh task using MPI_Isend. We T ory an i the and relie tatio on T tage send task task wid mes our that stag com proa num wid is th com low task the 4.3 F pute para miz sub focu and task sing as i

38 Performance impact indicators Denote allocation- costs of imbalance Map wait onto critical-path activites with excess Distinguish intra-partition and inter-partition imbalance costs High intra-partition costs, low inter-partition costs Very high inter-partition costs, low intra-partition costs Slide 21 September 26, 2013 David öhme

39 Example: ddcmd ddcmd molecular dynamics simulation on lue Gene/P Particle and mesh forces calculated in different partitions Slide 22 September 26, 2013 David öhme D. Richards et al.: eyond Homogeneous Figure 2: Task layout on the luegene/p torus. Cyan arrows indicate communication Decomposition, from MD tasks SC 10 to collector tasks, blue arrows indicate communication from collector tasks to FFT tasks. is the reduction of the contributions to ρ at each point from these multiple tasks to a single sum. This s ory acce an impro the L3 ( and rece relieved tations. on G/L The t tages ov sends al tasks an tasks. T width fr mesh po our typic that lie stage so complet proach t number width av is that th commun lows trad task in S the band 4.3 L For a pute nod

40 Example: ddcmd ddcmd molecular dynamics simulation on lue Gene/P Particle and mesh forces calculated in different partitions Fixed partition sizes: Slide 22 September 26, 2013 David öhme D. Richards et al.: eyond Homogeneous Figure 2: Task layout on the luegene/p torus. Cyan arrows indicate communication Decomposition, from MD tasks SC 10 to collector tasks, blue arrows indicate communication from collector tasks to FFT tasks. is the reduction of the contributions to ρ at each point from these multiple tasks to a single sum. This s ory acce an impro the L3 ( and rece relieved tations. on G/L The t tages ov sends al tasks an tasks. T width fr mesh po our typic that lie stage so complet proach t number width av is that th commun lows trad task in S the band 4.3 L For a pute nod

41 Example: ddcmd ddcmd molecular dynamics simulation on lue Gene/P Particle and mesh forces calculated in different partitions Fixed partition sizes: Tune mesh size to adjust load balance Slide 22 September 26, 2013 David öhme D. Richards et al.: eyond Homogeneous Figure 2: Task layout on the luegene/p torus. Cyan arrows indicate communication Decomposition, from MD tasks SC 10 to collector tasks, blue arrows indicate communication from collector tasks to FFT tasks. is the reduction of the contributions to ρ at each point from these multiple tasks to a single sum. This s ory acce an impro the L3 ( and rece relieved tations. on G/L The t tages ov sends al tasks an tasks. T width fr mesh po our typic that lie stage so complet proach t number width av is that th commun lows trad task in S the band 4.3 L For a pute nod

42 150 ddcmd: mesh size tuning Resource consumption (CPU-h Run (seconds) Resource consumption (CPU-hrs) Run (seconds) 50 0 Small mesh size increases workload of particle tasks Increasing mesh size shifts critical path to mesh tasks mesh size Critical path (on particle task) Critical path (on mesh task) Total resource consumption Inter-partition imbalance costs 50 mesh size Critical path (on particle task) Critical path (on mesh task) Total resource consumption Inter-partition imbalance costs Slide 23 September 26, 2013 David öhme

43 Parallel Performance nalysis Root-Cause nalysis Concepts Case study Critical-Path nalysis The critical path Critical-path imbalance indicator nalysis of MPMD programs Implementation Parallel trace replay Scalability evaluation Slide 24 September 26, 2013 David öhme

44 Implementation Integrated in Scalasca trace analysis toolset Highly scalable parallel trace analysis So far only for wait-state detection Slide 25 September 26, 2013 David öhme

45 Parallel trace replay pplication run Local event traces pplication records stamped communication events One trace file per process trace rank i ENTER SEND EXIT trace rank j ENTER RECV EXIT Slide 26 September 26, 2013 David öhme

46 Parallel trace replay pplication run Local event traces Parallel analysis nalysis report pplication records stamped communication events One trace file per process nalysis traverse traces in parallel Exchange information at original synchronization points trace rank i ENTER SEND EXIT trace rank j ENTER RECV EXIT Slide 26 September 26, 2013 David öhme

47 Trace analysis extensions Use multiple replay passes ackward replay lets data travel from effect to cause trace rank i ENTER SEND EXIT trace rank j ENTER RECV EXIT backward replay Slide 27 September 26, 2013 David öhme

48 Delay detection via backward replay 1 R 1 S 2 2 S 1 R 2 Wait state Slide 28 September 26, 2013 David öhme

49 Delay detection via backward replay 1. Identify synchronization interval 1 R 1 S 2 2 S 1 R 2 Wait state Synchronization interval Slide 28 September 26, 2013 David öhme

50 Delay detection via backward replay 1. Identify synchronization interval 2. Determine vectors t s and t r 1 R 1 t s t r S 2 2 S 1 R 2 Wait state Synchronization interval Slide 28 September 26, 2013 David öhme

51 Delay detection via backward replay 1. Identify synchronization interval 2. Determine vectors t s and t r 1 R 1 t s t r S 2 3. Transfer vector t r 2 S 1 R 2 Wait state Synchronization interval Slide 28 September 26, 2013 David öhme

52 Delay detection via backward replay 1. Identify synchronization interval 2. Determine vectors t s and t r 1 R 1 t s t r Delay t s t r S 2 3. Transfer vector t r 4. Locate delay 2 S 1 R 2 Wait state Synchronization interval Slide 28 September 26, 2013 David öhme

53 Delay detection via backward replay 1. Identify synchronization interval 2. Determine vectors t s and t r 1 R 1 t s t r Delay t s t r S 2 3. Transfer vector t r 4. Locate delay 5. Calculate costs 2 S 1 R 2 Wait state Synchronization interval Slide 28 September 26, 2013 David öhme

54 Critical-path extraction in backward replay 1 c=false 2 R 1 3 R 2 Slide 29 September 26, 2013 David öhme

55 Critical-path extraction in backward replay 1 2 R 1 c=false c=true 3 R 2 c=false Slide 29 September 26, 2013 David öhme

56 Critical-path extraction in backward replay c=false 1 c=true c=true 2 R 1 3 R 2 c=false Slide 29 September 26, 2013 David öhme

57 Critical-path extraction in backward replay c=true c=false 1 c=false c=true c=true 2 R 1 c=false c=false 3 R 2 Slide 29 September 26, 2013 David öhme

58 Scalability 1000 Wall (s) 100 Uninstrumented program execution Wait-state analysis only (trace replay) Processes Scalability of root-cause and critical-path analysis for the Sweep3D benchmark on lue Gene/P Slide 30 September 26, 2013 David öhme

59 Scalability 1000 Wall (s) 100 Wait-state and critical-path analysis (trace replay) Uninstrumented program execution Wait-state analysis only (trace replay) Processes Scalability of root-cause and critical-path analysis for the Sweep3D benchmark on lue Gene/P Slide 30 September 26, 2013 David öhme

60 Scalability 1000 Wall (s) 100 Combined wait-state, delay, and critical-path analysis (trace replay) Wait-state and critical-path analysis (trace replay) Uninstrumented program execution Wait-state analysis only (trace replay) Processes Scalability of root-cause and critical-path analysis for the Sweep3D benchmark on lue Gene/P Slide 30 September 26, 2013 David öhme

61 Summary Two novel methods to locate and quantify imbalance Root-cause analysis Delay Propagation Propagation Identifies delays and explains formation of wait states Slide 31 September 26, 2013 David öhme

62 Summary Two novel methods to locate and quantify imbalance Root-cause analysis Critical-path analysis 1 2 Delay Propagation Propagation 3 4 Identifies delays and explains formation of wait states Determines impact of inefficiency on run Slide 31 September 26, 2013 David öhme

63 Summary Two novel methods to locate and quantify imbalance Root-cause analysis Critical-path analysis 1 2 Delay Propagation Propagation 3 4 Identifies delays and explains formation of wait states Determines impact of inefficiency on run Highly scalable implementation for O(100k) Slide 31 September 26, 2013 David öhme

64 Thank you Further reading: D. öhme,. R. de Supinski, M. Geimer, M. Schulz, and F. Wolf: Scalable Critical-Path ased Performance nalysis. IPDPS D. öhme, M. Geimer, and F. Wolf: Characterizing Load and Communication Imbalance in Large-Scale Parallel pplications. IPDPS PhD Forum D. öhme, M. Geimer, F. Wolf, and L. rnold: Identifying the root causes of wait states in large-scale parallel applications. ICPP est paper award. D. öhme, M.-. Hermanns, M. Geimer, and F. Wolf: Performance simulation of non-blocking communication in message-passing applications. PROPER Slide 32 September 26, 2013 David öhme

65 SPECMPI case studies Uninstrumented execution Instrumented execution Trace replay 50 Imbalance percentage (profile metrics) Imbalance percentage (critical-path metrics) Run (sec) Imbalance percent RxML 122.tachyon 121.pop2 104.milc 137.lu 132.zeusmp2 129.tera_tf 128.GPgeofem 126.lammps 147.l2wrf2 145.lGemsFDTD 143.dleslie RxML 122.tachyon 121.pop2 104.milc 132.zeusmp2 129.tera_tf 128.GPgeofem 126.lammps 143.dleslie 137.lu 147.l2wrf2 145.lGemsFDTD Tracing run- overhead and trace replay s Critical-path vs. profile load imbalance metrics Slide 33 September 26, 2013 David öhme

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications

Center for Information Services and High Performance Computing (ZIH) Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications The Fourth International Workshop on Accelerators and Hybrid Exascale