Performance properties The metrics tour

Size: px

Start display at page:

Download "Performance properties The metrics tour"

Britney Williams
5 years ago
Views:

1 Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre August 2012

2 Scalasca analysis result

3 Online description Analysis report explorer GUI provides hyperlinked descriptions of performance properties Diagnosis hints suggest how to refine diagnosis of performance problems and possible remediation

4 Confused?

5 Generic metrics

6 Generic metrics Time Total CPU allocation time Execution Overhead Execution time without overhead Time spent in tasks related to measurement (Does not include per-function perturbation!) Visits Comp. imbalance Hardware counters Number of times a function/region was executed Simple load imbalance heuristics Aggregated counter values for each function/region

7 Computational imbalance hierarchy Comp. imbalance Overload Single participant Underload Non-participation Singularity

Computational imbalance Absolute difference to average exclusive execution time Focusses only on computational parts Captures global imbalances Based on entire measurement

8 Computational imbalance Absolute difference to average exclusive execution time Focusses only on computational parts Captures global imbalances Based on entire measurement Does not compare individual instances of function calls High value = Imbalance in the sub-calltree underneath Expand the subtree to find the real location of the imbalance

9 Overload Identifies processes/threads were exclusive execution time for the call-path was above average

10 Overload, Single participant Identifies call-paths executed by single process/thread

11 Underload Identifies processes/threads were exclusive execution time for the call-path was below average

12 Underload, Non-participation Identifies call-paths not executed by a subset of processes/threads

13 Underload, Singularity Identifies call-paths not executed by all but a single process/thread

14 MPI-related metrics

15 MPI Time hierarchy MPI Synchronization Collective RMA Active Target Passive Target Communication Point-to-point Collective RMA File I/O Collective Init/Exit

16 MPI Time hierarchy details MPI Synchronization Communication File I/O Init/Exit Time spent in pre-instrumented MPI functions Time spent in calls to MPI_Barrier or Remote Memory Access synchronization calls Time spent in MPI communication calls, subdivided into collective, point-to-point and RMA Time spent in MPI file I/O functions, with specialization for collective I/O calls Time spent in MPI_Init and MPI_Finalize

17 MPI Synchronizations hierarchy Synchronizations Point-to-point Collective RMA Sends Receives Fences GATS Epochs Locks Provides the number of calls to an MPI synchronization function of the corresponding class Synchronizations include zero-sized message transfers! Access Epochs Exposure Epochs

18 MPI Communications hierarchy Communications Point-to-point Sends Receives Collective Exchange As source As destination RMA Puts Gets Provides the number of calls to an MPI communication function of the corresponding class Zero-sized message transfers are considered synchronization!

19 MPI Transfer hierarchy Bytes transferred Point-to-point Collective RMA Sent Received Outgoing Incoming Sent Received Provides the number of bytes transferred by an MPI communication function of the corresponding class

20 MPI File operations hierarchy MPI file operations Individual Collective Reads Writes Reads Writes Provides the number of calls to MPI file I/O functions of the corresponding class

21 MPI File bytes transferred hierarchy MPI file bytes transferred Individual Collective Read Written Read Written Provides the number of bytes for MPI file I/O functions of the corresponding class

22 MPI collective synchronization time MPI Synchronization Collective Wait at Barrier Barrier Completion RMA Communication File I/O Init/Exit

23 Wait at Barrier location MPI_Barrier MPI_Barrier MPI_Barrier MPI_Barrier time Time spent waiting in front of a barrier call until the last process reaches the barrier operation Applies to: MPI_Barrier

24 Barrier Completion location MPI_Barrier MPI_Barrier MPI_Barrier MPI_Barrier time Time spent in barrier after the first process has left the operation Applies to: MPI_Barrier

25 MPI collective communication time MPI Synchronization Communication Point-to-point Collective Early Reduce Early Scan Late Broadcast Wait at N x N N x N Completion RMA

26 Wait at N x N location MPI_Allreduce MPI_Allreduce MPI_Allreduce MPI_Allreduce Time spent waiting in front of a synchronizing collective operation call until the last process reaches the operation Applies to: MPI_Allgather, MPI_Allgatherv, MPI_Allreduce, MPI_Alltoall, MPI_Reduce_scatter, MPI_Reduce_scatter_block time

27 N x N Completion location MPI_Allreduce MPI_Allreduce MPI_Allreduce MPI_Allreduce time Time spent in synchronizing collective operations after the first process has left the operation Applies to: MPI_Allgather, MPI_Allgatherv, MPI_Allreduce, MPI_Alltoall, MPI_Reduce_scatter, MPI_Reduce_scatter_block

28 Late Broadcast location MPI_Bcast MPI_Bcast (root) MPI_Bcast MPI_Bcast Waiting times if the destination processes of a collective 1-to-N communication operation enter the operation earlier than the source process (root) Applies to: MPI_Bcast, MPI_Scatter, MPI_Scatterv time

29 Early Reduce location MPI_Reduce MPI_Reduce MPI_Reduce (root) MPI_Reduce Waiting time if the destination process (root) of a collective N-to-1 communication operation enters the operation earlier than its sending counterparts Applies to: MPI_Reduce, MPI_Gather, MPI_Gatherv time

30 Early Scan location MPI_Scan MPI_Scan 0 1 MPI_Scan 2 MPI_Scan 3 Waiting time if process n enters a prefix reduction operation earlier than its sending counterparts (i.e., ranks 0..n-1) Applies to: MPI_Scan, MPI_Exscan time

31 MPI point-to-point communication time MPI Synchronization Communication Point-to-point Late Sender Msg. in Wrong Order Same Source Different Source Late Receiver Collective RMA

32 Late Sender location MPI_Send MPI_Send MPI_Recv MPI_Irecv MPI_Wait time location MPI_Isend MPI_Wait MPI_Isend MPI_Wait MPI_Recv MPI_Irecv MPI_Wait time Waiting time caused by a blocking receive operation posted earlier than the corresponding send operation Applies to blocking as well as non-blocking communication

33 Late Sender (II) location MPI_Send MPI_Send MPI_Irecv MPI_Waitall time While waiting for several messages, the maximum waiting time is accounted Applies to: MPI_Waitall, MPI_Waitsome

34 Late Sender, Messages in Wrong Order location MPI_Send MPI_Send MPI_Send MPI_Send MPI_Recv MPI_Recv MPI_Recv MPI_Recv time Refers to Late Sender situations which are caused by messages received in wrong order Comes in two flavours: Messages sent from same source location Messages sent from different source locations

35 Late Receiver location MPI_Send MPI_Recv MPI_Irecv MPI_Send MPI_Wait time Waiting time caused by a blocking send operation posted earlier than the corresponding receive operation Calculated by receiver but waiting time attributed to sender Does currently not apply to non-blocking sends

36 Late Sender/Receiver Counts The number of Late Sender / Late Receiver instances are also available They are divided into communications & synchronizations and shown in the corresponding hierarchies

37 MPI RMA synchronization time MPI Synchronization Collective RMA Active Target Late Post Wait at Fence Early Fence Early Wait Late Complete Passive Target

38 Late Post location Win_start Put Win_post Win_complete Win_wait Win_start Put Win_complete time MPI_Win_start (top) or MPI_Win_complete (bottom) wait until exposure epoch is opened by MPI_Win_post Which of the two calls blocks is implementation dependent

39 Wait at Fence location Put Win_fence Win_fence time Time spent waiting in front of a synchronizing MPI_Win_fence call until the last process reaches the fence operation Only triggered if at least one of the following conditions applies Given assertion is 0 All fence calls overlap (heuristic)

40 Early Fence location Put Win_fence Win_fence time Time spent waiting for exit of last RMA operation to target location Sub-pattern of Wait at Fence

41 Early Wait location Win_start Put Win_complete Win_start Put Win_complete Win_post Win_wait time Time spent in MPI_Win_wait until access epoch is closed by last MPI_Win_complete

42 Late Complete location Win_start Put Win_complete Win_start Put Win_complete Win_post Win_wait time Waiting time due to unnecessary pause between last RMA operation to target and closing the access epoch by last MPI_Win_complete Sub-pattern of Early Wait

43 MPI RMA communication time MPI Synchronization Communication Point-to-point Collective RMA Early Transfer

44 Early Transfer location Win_start Put Win_complete Win_post Win_wait time Time spent waiting in RMA operation on origin(s) started before exposure epoch was opened on target

45 OpenMP-related metrics

46 OpenMP Time hierarchy Time Execution MPI OMP Flush Management Synchronization Overhead Idle Threads Limited parallelism Fork

47 OpenMP Time hierarchy details OMP Flush Synchronization Time spent for OpenMP-related tasks Time spent in OpenMP flush directives Time spent to synchronize OpenMP threads

48 OpenMP Management Time location serial parallel region body serial parallel region body parallel region body time Time spent on master thread for creating/destroying OpenMP thread teams

49 OpenMP Fork Time location serial parallel region body serial parallel region body parallel region body time Time spent on master threads for creating OpenMP thread teams

50 OpenMP Idle Threads location serial parallel region body serial parallel region body parallel region body time Time spent idle on CPUs reserved for worker threads

51 OpenMP Limited Parallelism location serial parallel region body serial parallel region body parallel region body time Time spent idle on worker threads within parallel regions

52 OpenMP Synchronization Time hierarchy OMP Flush Management Synchronization Barrier Critical Lock API Ordered Explicit Implicit Time spent in OpenMP atomic constructs is attributed to the Critical metric

53 OpenMP barrier synchronization time OpenMP Synchronization Barrier Explicit Implicit Wait at Barrier Wait at Barrier

54 Wait at Barrier location OpenMP barrier OpenMP barrier OpenMP barrier OpenMP barrier time Time spent waiting in front of a barrier call until the last process reaches the barrier operation Applies to: Implicit/explicit barriers

55 OpenMP-related metrics (as produced by Scalasca's sequential trace analyzer for OpenMP and hybrid MPI/OpenMP applications)

56 OpenMP Time hierarchy Time Execution Idle Threads Overhead MPI OpenMP Synchronization Fork Flush

57 OpenMP Time hierarchy details OpenMP Synchronization Fork Flush Idle Threads Time spent for OpenMP-related tasks Time spent for synchronizing OpenMP threads Time spent by master thread to create thread teams Time spent in OpenMP flush directives Time spent idle on CPUs reserved for worker threads

58 OpenMP synchronization time OpenMP Synchronization Barrier Explicit Implicit Lock Competition API Critical Wait at Barrier Wait at Barrier

59 Lock Competition location Acquire Lock Release Lock Acquire Lock Release Lock time Time spent waiting for a lock that has been previously acquired by another thread Applies to: critical sections, OpenMP lock API

60 Happy end...

Performance properties The metrics tour

Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de January 2012 Scalasca analysis result Confused? Generic metrics Generic metrics Time