Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1

Size: px

Start display at page:

Download "Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1"

Tracey Marshall
5 years ago
Views:

1 Score-P SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1

2 Score-P Functionality Score-P is a joint instrumentation and measurement system for a number of PA tools. Provide typical functionality for HPC performance tools Support all fundamental concepts of partner s tools Instrumentation (various methods) Flexible measurement without re-compilation: Basic and advanced profile generation Event trace recording Online access to profiling data MPI/SHMEM, OpenMP/Pthreads, and hybrid parallelism (and serial) Enhanced functionality (OpenMP 3.0, CUDA, highly scalable I/O) SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 2

3 Score-P Architecture Vampir Scalasca CUBE TAU TAUdb Periscope Event traces (OTF2) Call-path profiles (CUBE4, TAU) Hardware counter (PAPI, rusage) Score-P measurement infrastructure Online interface Instrumentation wrapper Process-level parallelism (MPI, SHMEM) Thread-level parallelism (OpenMP, Pthreads) Accelerator-based parallelism (CUDA) Source code instrumentation User instrumentation Application SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 3

4 CUBE SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 4

5 CUBE Parallel program analysis report exploration tools GUI for interactive analysis exploration Algebra utilities for report processing Originally developed as part of Scalasca toolset Now available as a separate component Can be installed independently of Score-P, e.g., on laptop or desktop Latest release: CUBE (June 2014) SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 5 5

6 Property Analysis presentation and exploration Representation of values (severity matrix) on three hierarchical axes Performance property (metric) Call path (program location) System location (process/thread) Three coupled tree browsers CUBE displays severities As value: for precise comparison As colour: for easy identification of hotspots Inclusive value when closed & exclusive value when expanded Call path Location SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 6 6

7 Analysis presentation What kind of performance metric? Where is it in the source code? In what context? How is it distributed across the processes/threads? SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 7 7

8 Inclusive vs. Exclusive values Inclusive Information of all sub-elements aggregated into single value Exclusive Information cannot be subdivided further int foo() { int a; a = 1 + 1; Inclusive Exclusive bar(); } a = a + 1; return a; SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 8 8

9 SCALASCA SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 9

10 Property Automatic trace analysis Idea Automatic search for patterns of inefficient behavior Classification of behavior & quantification of significance Low-level event trace Analysis High-level result Call path Location Guaranteed to cover the entire event trace Quicker than manual/visual trace analysis Parallel replay analysis exploits available memory & processors to deliver scalability of the analysis SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 10 10

target application HWC Local event traces Parallel waitstate search Scalasca trace analysis Wait-state

11 Report manipulation Scalasca workflow Measurement library Optimized measurement configuration Summary report Instr. target application HWC Local event traces Parallel waitstate search Scalasca trace analysis Wait-state report Instrumented executable Which problem? Where in the program? Which process? Instrumenter compiler / linker Source modules SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 11 11

12 Example: Late Sender MPI_Send MPI_Send location MPI_Recv MPI_Irecv MPI_Wait time MPI_Isend MPI_Wait MPI_Isend MPI_Wait location MPI_Recv MPI_Irecv MPI_Wait Waiting time caused by a blocking receive operation posted earlier than the corresponding send Applies to blocking as well as non-blocking communication time SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 12 12

13 Example: Late Broadcast MPI_Bcast location MPI_Bcast (root) MPI_Bcast MPI_Bcast Waiting times if the destination processes of a collective 1-to-N operation enter the operation earlier than the source process (root) Applies to: MPI_Bcast, MPI_Scatter, MPI_Scatterv time SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 13 13

14 Post-processed trace analysis report Additional trace-based metrics in metric hierarchy SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 14 14

15 Vampir SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 15

Mission Visualization of dynamics of complex parallel processes Full details for arbitrary temporal and spatial levels Supplement to automatic analysis Client-server architecture Typical questions

16 Mission Visualization of dynamics of complex parallel processes Full details for arbitrary temporal and spatial levels Supplement to automatic analysis Client-server architecture Typical questions that Vampir helps to answer: What happens in my application execution during a given time in a given process or thread? How do the communication patterns of my application execute on a real system? Are there any imbalances in computation, I/O or memory usage and how do they affect the parallel execution of my application? SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 16

time axis Message Summary Process Summary Communication Matrix View Provide quantitative results for the

17 The Main Displays of Vampir Timeline Charts: Master Timeline Process Timeline Counter Data Timeline Performance Radar Summary Charts: Function Summary Show application activities and communication along a time axis Message Summary Process Summary Communication Matrix View Provide quantitative results for the currently selected time interval SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 17

18 Vampir: Visualization of the NPB-MZ-MPI / BT trace Navigation Toolbar Function Summary Master Timeline Function Legend SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 18

19 Vampir: Visualization of the NPB-MZ-MPI / BT trace Process Timeline Detailed information about different levels of function calls in a stacked bar chart for an individual process. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 19

20 Vampir: Visualization of the NPB-MZ-MPI / BT trace Counter Data Timeline Detailed counter information over time for an individual process. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 20

21 Vampir: Visualization of the NPB-MZ-MPI / BT trace Performance Radar Detailed counter information over time for a collection of processes. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 21

22 Vampir: Visualization of the NPB-MZ-MPI / BT trace Zoom in: Initialization Phase Context View: Detailed information about function initialize_. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 22

Vampir: Visualization of the NPB-MZ-MPI / BT trace Process Summary Function Summary: Overview of the accumulated information across all functions and for a collection of processes.

23 Vampir: Visualization of the NPB-MZ-MPI / BT trace Process Summary Function Summary: Overview of the accumulated information across all functions and for a collection of processes. Process Summary: Overview of the accumulated information across all functions and for every process independently. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 23

Automatic trace analysis with the Scalasca Trace Tools

Automatic trace analysis with the Scalasca Trace Tools Ilya Zhukov Jülich Supercomputing Centre Property Automatic trace analysis Idea Automatic search for patterns of inefficient behaviour Classification