Scalasca performance properties The metrics tour

Size: px

Start display at page:

Download "Scalasca performance properties The metrics tour"

Roland Wells
5 years ago
Views:

1 Scalasca performance properties The metrics tour Markus Geimer

2 Scalasca analysis result

3 Generic metrics

4 Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware counters Comp. Imbalance Execution time without overhead Time spent in tasks related to measurement (Does not include per-function perturbation!) Number of times a function/region was executed Aggregated counter values for each function/region Simple load imbalance heuristic

5 Computational imbalance Simple load imbalance heuristic Focuses only on computational parts Easy to calculate Absolute difference to average exclusive execution time Captures global imbalances Based on entire measurement Does not compare individual instances of function calls High value = Imbalance in the sub-calltree underneath Expand the subtree to find the real location of the imbalance

6 MPI-related metrics

7 MPI Time hierarchy Time Execution Overhead MPI Synchronization Collective Communication Point-to-point Collective File I/O Collective Init/Exit

8 MPI Time hierarchy details MPI Synchronization Communication File I/O Init/Exit Time spent in pre-instrumented MPI functions Time spent in calls to MPI_Barrier Time spent in MPI communication calls, subdivided into collective and point-to-point Time spent in MPI file I/O functions, with specialization for collective I/O calls Time spent in MPI_Init and MPI_Finalize

9 MPI Communications hierarchy Communications Point-to-point Sends Receives Collective Exchange As Source As Destination Provides the number of calls to an MPI communication function of the corresponding class Zero-sized message transfers are considered synchronization!

10 MPI Synchronizations hierarchy Synchronizations Point-to-point Collective Sends Receives Provides the number of calls to an MPI synchronization function of the corresponding class Synchronizations include zero-sized message transfers!

11 MPI Transfer hierarchy Bytes transferred Point-to-point Collective Sent Received Outgoing Incoming Provides the number of bytes transferred by an MPI communication function of the corresponding class

12 MPI File operations hierarchy MPI file operations Individual Collective Reads Writes Reads Writes Provides the number of calls to MPI file I/O functions of the corresponding class

13 MPI collective synchronization time MPI Synchronization Collective Wait at Barrier Barrier Completion Communication File I/O Init/Exit

14 Wait at Barrier location MPI_Barrier MPI_Barrier MPI_Barrier MPI_Barrier time Time spent waiting in front of a barrier call until the last process reaches the barrier operation Applies to: MPI_Barrier

15 Barrier Completion location MPI_Barrier MPI_Barrier MPI_Barrier MPI_Barrier time Time spent in barrier after the first process has left the operation Applies to: MPI_Barrier

16 MPI collective communication time MPI Synchronization Communication Point-to-point Collective Early Reduce Early Scan Late Broadcast Wait at N x N N x N Completion

17 Wait at N x N location MPI_Allreduce MPI_Allreduce MPI_Allreduce MPI_Allreduce Time spent waiting in front of a synchronizing collective operation call until the last process reaches the operation Applies to: MPI_Allreduce, MPI_Alltoall, MPI_Alltoallv, MPI_Allgather, MPI_Allgatherv, MPI_Reduce_scatter time

18 N x N Completion location MPI_Allreduce MPI_Allreduce MPI_Allreduce MPI_Allreduce Time spent in synchronizing collective operations after the first process has left the operation Applies to: MPI_Allreduce, MPI_Alltoall, MPI_Alltoallv, MPI_Allgather, MPI_Allgatherv, MPI_Reduce_scatter time

19 Late Broadcast location MPI_Bcast MPI_Bcast (root) MPI_Bcast MPI_Bcast Waiting times if the destination processes of a collective 1-to-N communication operation enter the operation earlier than the source process (root) Applies to: MPI_Bcast, MPI_Scatter, MPI_Scatterv time

20 Early Reduce location MPI_Reduce MPI_Reduce MPI_Reduce (root) MPI_Reduce Waiting time if the destination process (root) of a collective N-to-1 communication operation enters the operation earlier than its sending counterparts Applies to: MPI_Reduce, MPI_Gather, MPI_Gatherv time

21 Early Scan location MPI_Scan MPI_Scan 0 1 MPI_Scan 2 MPI_Scan 3 Waiting time if process n enters a prefix reduction operation earlier than its sending counterparts (i.e., ranks 0..n-1) Applies to: MPI_Scan time

22 MPI point-to-point communication time MPI Synchronization Communication Point-to-point Late Sender Msg. in Wrong Order Same Source Different Source Late Receiver Collective

23 Late Sender location MPI_Send MPI_Send MPI_Recv MPI_Irecv MPI_Wait time location MPI_Isend MPI_Wait MPI_Isend MPI_Wait MPI_Recv MPI_Irecv MPI_Wait time Waiting time caused by a blocking receive operation posted earlier than the corresponding send operation Applies to blocking as well as non-blocking communication

24 Late Sender (II) location MPI_Send MPI_Send MPI_Irecv MPI_Waitall time While waiting for several messages, the maximum waiting time is accounted Applies to: MPI_Waitall, MPI_Waitsome

25 Late Sender, Messages in Wrong Order location MPI_Send MPI_Send MPI_Send MPI_Send MPI_Recv MPI_Recv MPI_Recv MPI_Recv time Refers to Late Sender situations which are caused by messages received in wrong order Comes in two flavours: Messages sent from same source location Messages sent from different source locations

26 Late Receiver location MPI_Send MPI_Recv MPI_Irecv MPI_Send MPI_Wait time Waiting time caused by a blocking send operation posted earlier than the corresponding receive operation Calculated by receiver but waiting time attributed to sender Does currently not apply to non-blocking sends

27 Late Sender/Receiver Counts The number of Late Sender / Late Receiver instances are also available They are divided into communications & synchronizations and shown in the corresponding hierarchies

28 OpenMP-related metrics (as produced by Scalasca 1.2 runtime summarization and trace analysis for hybrid MPI/OpenMP apps)

29 OpenMP Time hierarchy Time Execution MPI OMP Flush Management Synchronization Overhead Idle Threads Limited parallelism Fork

30 OpenMP Time hierarchy details OMP Flush Synchronization Time spent for OpenMP-related tasks Time spent in OpenMP flush directives Time spent to synchronize OpenMP threads

31 OpenMP Management Time location serial parallel region body serial parallel region body parallel region body time Time spent on master thread for creating/destroying OpenMP thread teams

32 OpenMP Fork Time location serial parallel region body serial parallel region body parallel region body time Time spent on master threads for creating OpenMP thread teams

33 OpenMP Idle Threads location serial parallel region body serial parallel region body parallel region body time Time spent idle on CPUs reserved for worker threads

34 OpenMP Limited Parallelism location serial parallel region body serial parallel region body parallel region body time Time spent idle on worker threads within parallel regions

35 OpenMP Synchronization Time hierarchy OMP Flush Management Synchronization Barrier Critical Lock API Explicit Implicit Time spent in OpenMP atomic constructs is attributed to the Critical metric

36 OpenMP-related metrics (as produced by Scalasca 1.2 trace analysis for pure OpenMP apps)

37 OpenMP Time hierarchy Time Execution Idle Threads Overhead MPI OpenMP Synchronization Fork Flush

38 OpenMP Time hierarchy details OpenMP Synchronization Fork Flush Idle Threads Time spent for OpenMP-related tasks Time spent for synchronizing OpenMP threads Time spent by master thread to create thread teams Time spent in OpenMP flush directives Time spent idle on CPUs reserved for worker threads

39 OpenMP synchronization time OpenMP Synchronization Barrier Explicit Implicit Lock Competition API Critical Wait at Barrier Wait at Barrier

40 Wait at Barrier location OpenMP barrier OpenMP barrier OpenMP barrier OpenMP barrier time Time spent waiting in front of a barrier call until the last process reaches the barrier operation Applies to: Implicit/explicit barriers

41 Lock Competition location Acquire Lock Release Lock Acquire Lock Release Lock time Time spent waiting for a lock that has been previously acquired by another thread Applies to: critical sections, OpenMP lock API

Scalasca performance properties The metrics tour

Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware