Scalasca performance properties The metrics tour

Size: px

Start display at page:

Download "Scalasca performance properties The metrics tour"

Alyson Harrell
5 years ago
Views:

1 Scalasca performance properties The metrics tour Markus Geimer

2 Scalasca analysis result

3 Generic metrics

4 Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware counters Comp. Imbalance Execution time without overhead Time spent in tasks related to measurement (Does not include per-function perturbation!) Number of times a function/region was executed Aggregated counter values for each function/region Simple load imbalance heuristic

5 Computational imbalance Simple load imbalance heuristic Focuses only on computational parts Easy to calculate Absolute difference to average exclusive execution time Captures global imbalances Based on entire measurement Does not compare individual instances of function calls High value = Imbalance in the sub-calltree underneath Expand the subtree to find the real location of the imbalance

6 MPI-related metrics

7 MPI Time hierarchy MPI Synchronization Collective RMA Active Target Passive Target Communication Point-to-point Collective RMA File I/O Collective Init/Exit

8 MPI Time hierarchy details MPI Synchronization Communication File I/O Init/Exit Time spent in pre-instrumented MPI functions Time spent in calls to MPI_Barrier or Remote Memory Access synchronization calls Time spent in MPI communication calls, subdivided into collective, point-to-point and RMA Time spent in MPI file I/O functions, with specialization for collective I/O calls Time spent in MPI_Init and MPI_Finalize

9 MPI Synchronizations hierarchy Synchronizations Point-to-point Collective RMA Sends Receives Fences GATS Epochs Locks Provides the number of calls to an MPI synchronization function of the corresponding class Synchronizations include zero-sized message transfers! Access Epochs Exposure Epochs

10 MPI Communications hierarchy Communications Point-to-point Sends Receives Collective Exchange As source As destination RMA Puts Gets Provides the number of calls to an MPI communication function of the corresponding class Zero-sized message transfers are considered synchronization!

11 MPI Transfer hierarchy Bytes transferred Point-to-point Collective RMA Sent Received Outgoing Incoming Sent Received Provides the number of bytes transferred by an MPI communication function of the corresponding class

12 MPI File operations hierarchy MPI file operations Individual Collective Reads Writes Reads Writes Provides the number of calls to MPI file I/O functions of the corresponding class

13 MPI collective synchronization time MPI Synchronization Collective Wait at Barrier Barrier Completion RMA Communication File I/O Init/Exit

14 Wait at Barrier location MPI_Barrier MPI_Barrier MPI_Barrier MPI_Barrier time Time spent waiting in front of a barrier call until the last process reaches the barrier operation Applies to: MPI_Barrier

15 Barrier Completion location MPI_Barrier MPI_Barrier MPI_Barrier MPI_Barrier time Time spent in barrier after the first process has left the operation Applies to: MPI_Barrier

16 MPI collective communication time MPI Synchronization Communication Point-to-point Collective Early Reduce Early Scan Late Broadcast Wait at N x N N x N Completion RMA

17 Wait at N x N location MPI_Allreduce MPI_Allreduce MPI_Allreduce MPI_Allreduce Time spent waiting in front of a synchronizing collective operation call until the last process reaches the operation Applies to: MPI_Allgather, MPI_Allgatherv, MPI_Allreduce, MPI_Alltoall, MPI_Reduce_scatter, MPI_Reduce_scatter_block time

18 N x N Completion location MPI_Allreduce MPI_Allreduce MPI_Allreduce MPI_Allreduce time Time spent in synchronizing collective operations after the first process has left the operation Applies to: MPI_Allgather, MPI_Allgatherv, MPI_Allreduce, MPI_Alltoall, MPI_Reduce_scatter, MPI_Reduce_scatter_block

19 Late Broadcast location MPI_Bcast MPI_Bcast (root) MPI_Bcast MPI_Bcast Waiting times if the destination processes of a collective 1-to-N communication operation enter the operation earlier than the source process (root) Applies to: MPI_Bcast, MPI_Scatter, MPI_Scatterv time

20 Early Reduce location MPI_Reduce MPI_Reduce MPI_Reduce (root) MPI_Reduce Waiting time if the destination process (root) of a collective N-to-1 communication operation enters the operation earlier than its sending counterparts Applies to: MPI_Reduce, MPI_Gather, MPI_Gatherv time

21 Early Scan location MPI_Scan MPI_Scan 0 1 MPI_Scan 2 MPI_Scan 3 Waiting time if process n enters a prefix reduction operation earlier than its sending counterparts (i.e., ranks 0..n-1) Applies to: MPI_Scan, MPI_Exscan time

22 MPI point-to-point communication time MPI Synchronization Communication Point-to-point Late Sender Msg. in Wrong Order Same Source Different Source Late Receiver Collective RMA

23 Late Sender location MPI_Send MPI_Send MPI_Recv MPI_Irecv MPI_Wait time location MPI_Isend MPI_Wait MPI_Isend MPI_Wait MPI_Recv MPI_Irecv MPI_Wait time Waiting time caused by a blocking receive operation posted earlier than the corresponding send operation Applies to blocking as well as non-blocking communication

24 Late Sender (II) location MPI_Send MPI_Send MPI_Irecv MPI_Waitall time While waiting for several messages, the maximum waiting time is accounted Applies to: MPI_Waitall, MPI_Waitsome

25 Late Sender, Messages in Wrong Order location MPI_Send MPI_Send MPI_Send MPI_Send MPI_Recv MPI_Recv MPI_Recv MPI_Recv time Refers to Late Sender situations which are caused by messages received in wrong order Comes in two flavours: Messages sent from same source location Messages sent from different source locations

26 Late Receiver location MPI_Send MPI_Recv MPI_Irecv MPI_Send MPI_Wait time Waiting time caused by a blocking send operation posted earlier than the corresponding receive operation Calculated by receiver but waiting time attributed to sender Does currently not apply to non-blocking sends

27 Late Sender/Receiver Counts The number of Late Sender / Late Receiver instances are also available They are divided into communications & synchronizations and shown in the corresponding hierarchies

28 MPI RMA synchronization time MPI Synchronization Collective RMA Active Target Late Post Wait at Fence Early Fence Early Wait Late Complete Passive Target

29 Late Post location Win_start Put Win_post Win_complete Win_wait Win_start Put Win_complete time MPI_Win_start (top) or MPI_Win_complete (bottom) wait until exposure epoch is opened by MPI_Win_post Which of the two calls blocks is implementation dependent

30 Wait at Fence location Put Win_fence Win_fence time Time spent waiting in front of a synchronizing MPI_Win_fence call until the last process reaches the fence operation Only triggered if at least one of the following conditions applies Given assertion is 0 All fence calls overlap (heuristic)

31 Early Fence location Put Win_fence Win_fence time Time spent waiting for exit of last RMA operation to target location Sub-pattern of Wait at Fence

32 Early Wait location Win_start Put Win_complete Win_start Put Win_complete Win_post Win_wait time Time spent in MPI_Win_wait until access epoch is closed by last MPI_Win_complete

33 Late Complete location Win_start Put Win_complete Win_start Put Win_complete Win_post Win_wait time Waiting time due to unnecessary pause between last RMA operation to target and closing the access epoch by last MPI_Win_complete Sub-pattern of Early Wait

34 MPI RMA communication time MPI Synchronization Communication Point-to-point Collective RMA Early Transfer

35 Early Transfer location Win_start Put Win_complete Win_post Win_wait time Time spent waiting in RMA operation on origin(s) started before exposure epoch was opened on target

36 OpenMP-related metrics

37 OpenMP Time hierarchy Time Execution MPI OMP Flush Management Synchronization Overhead Idle Threads Limited parallelism Fork

38 OpenMP Time hierarchy details OMP Flush Synchronization Time spent for OpenMP-related tasks Time spent in OpenMP flush directives Time spent to synchronize OpenMP threads

39 OpenMP Management Time location serial parallel region body serial parallel region body parallel region body time Time spent on master thread for creating/destroying OpenMP thread teams

40 OpenMP Fork Time location serial parallel region body serial parallel region body parallel region body time Time spent on master threads for creating OpenMP thread teams

41 OpenMP Idle Threads location serial parallel region body serial parallel region body parallel region body time Time spent idle on CPUs reserved for worker threads

42 OpenMP Limited Parallelism location serial parallel region body serial parallel region body parallel region body time Time spent idle on worker threads within parallel regions

43 OpenMP Synchronization Time hierarchy OMP Flush Management Synchronization Barrier Critical Lock API Explicit Implicit Time spent in OpenMP atomic constructs is attributed to the Critical metric

44 OpenMP-related metrics (as produced by Scalasca's sequential trace analyzer for OpenMP and hybrid MPI/OpenMP applications)

45 OpenMP Time hierarchy Time Execution Idle Threads Overhead MPI OpenMP Synchronization Fork Flush

46 OpenMP Time hierarchy details OpenMP Synchronization Fork Flush Idle Threads Time spent for OpenMP-related tasks Time spent for synchronizing OpenMP threads Time spent by master thread to create thread teams Time spent in OpenMP flush directives Time spent idle on CPUs reserved for worker threads

47 OpenMP synchronization time OpenMP Synchronization Barrier Explicit Implicit Lock Competition API Critical Wait at Barrier Wait at Barrier

48 Wait at Barrier location OpenMP barrier OpenMP barrier OpenMP barrier OpenMP barrier time Time spent waiting in front of a barrier call until the last process reaches the barrier operation Applies to: Implicit/explicit barriers

49 Lock Competition location Acquire Lock Release Lock Acquire Lock Release Lock time Time spent waiting for a lock that has been previously acquired by another thread Applies to: critical sections, OpenMP lock API

Performance properties The metrics tour

Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de January 2012 Scalasca analysis result Confused? Generic metrics Generic metrics Time