Introduction to the Intel tracing tools

Size: px

Start display at page:

Download "Introduction to the Intel tracing tools"

Eileen Holland
5 years ago
Views:

1 Introduction to the Intel tracing tools R. Bader (LRZ) M. Deilmann (Intel)

2 Help - my parallel program doesn t scale! Isolate performance issues in parallel programs... especially those in large MPI programs inefficient MPI programming bottlenecks latency-dominated performance load imbalance deadlock (hmmm...)... but also those via subroutine time consumption Existing facilities gprof for subroutines PMPI profiling interface (MPI standard!) + upshot... have problems with large parallel programs: Trace files cannot be easily analyzed 2

3 Intel Tracing Tools Solve the above problem: integration of subroutine and MPI profiling originally developed by ZI für Mathematik/FZ Jülich marketed by Pallas, now Intel Platforms x86 (Solaris & Linux) IPF, (EM64T) (Linux) Future development on non-intel platforms unclear Vampir NG (FZ Jülich, ZIH Dresden) Supercomputers: Altix 3700 and 4700 NEC SX-4/5/6/8 CRAY Version installed at LRZ at RRZE Campus license GUI Version 7.1 Tracing Library Version 7.1 not available trace libraries are rather expensive 3

4 Documentation and resources Intel Web Site Note: name mapping VAMPIR Intel trace analyzer VAMPIRtrace Intel trace collector LRZ Web Site see especially the links to the user s guides at the end of this document specific Usage instructions for each platform the tracing libraries are available on 4

5 Basic usage of tracing libraries and GUI

6 Two Components of ITT 1. Instrumentation 2. Visualization MPI library calls: divert to profiling interface Achieved by specifying vtrace switch to mpif90/mpicc commands this is a LRZ-specific setting make things work uniformly across platforms Intel MPI: -t=log instead In most cases, relinking sufficient Additional functionality subroutine tracing switching tracing on and off requires source code changes and recompilation during/after program run tracefile is written to disk may be visualized on any platform using the ITA GUI many views of the trace data available via a transparent menu structure hundreds of MPI processes viewable with good performance STF (Structured Trace File) format still lots of data 6

7 Step 1: Instrument your Code Example: MPI heat conduction example Uses non-blocking sends and receives as well as reduction operations sets up MPI data types and topology (MPI_CART_CREATE) set up environment: module load mpi_tracing recompile code (for inclusion of subroutine tracing): mpif90 -vtrace -c <lots_of_options> <source_file> relink binary: mpif90 -vtrace -o heat_mpi.exe <lots_of_objects> Warning: do not specify MPI libraries at your own discretion ruins link sequence set up by vtrace! this is the name of our application 7

8 Step 2: Prepare Configuration File Edit a file vt_heat to contain the following lines: # Log file LOGFILE-NAME heat.stf LOGFILE-FORMAT STF # disable all MPI activity ACTIVITY MPI OFF # enable all bcasts, recvs and sends SYMBOL MPI_WAITALL ON SYMBOL MPI_IRECV ON SYMBOL MPI_ISEND ON SYMBOL MPI_BARRIER ON SYMBOL MPI_ALLREDUCE ON # enable all activities in the Application class ACTIVITY Application ON 8

9 Step 3: Run the application Set up environment: module load mpi_tracing export VT_CONFIG=vt_heat Run the executable: mpirun -np 4./heat_mpi.exe (if not yet done) Upon completion you should get the message [0] Intel Trace Collector INFO: Writing tracefile heat.stf in /home/cluster/a2832ba/kurse/parprog_2005/vt/heat and a number of files heat.stf* should exist. To save disk space, convert to stfsingle: stftool heat.stf --convert - --logfile-format STFSINGLE \ gzip -c > heat.stfsingle.gz Do not use this format for viewing! 9

10 Step 4: Start the GUI X11R5 or higher is needed Is X Authentication working (Use ssh X...)? Now enter traceanalyzer pmatmul.stf The Qt based Trace Analyzer GUI should now start up, showing the following widgets: Main window Inside main window: panel referring to the presently analyzed tracefile there can be more than one panel/tracefile in the main window at any time 10

11 And this is what things look like at start: Chart: Function profile Flat profile 11

12 Overall load balance times inclusive traced subroutine calls 12

13 Resolving the MPI calls right klick Ungroup Group MPI can sort by category Times for each MPI routine specified in config file 13

14 Call tree Notes: call tree can only resolve activities switched on for tracing By default, User code not resolvable unless automatic or manual subroutine tracing compiled in (automatic subroutine tracing not yet available on Intel) 14

15 Call tree split up into user processes Select Children of Group All_Processes can also select expansion into various groups, two of which are defined at least: MPI Application (dotted circle click) 15

16 ... call up function group editor Select e.g., one of the major function groups and press OK You will then essentially filter out all other groups from the view... 16

17 ... like so: 17

18 Now a different view: Time-Line of all MPI processes Select Charts Event Timeline this will open an additional pane in the tracing subwindow then zoom in to some region of interest by using the left mouse button do this repeatedly to obtain... 18

19 ... this small, second trace section red parts: (different shades) are the various MPI activities blue parts: User code black lines: communication, left click for context menu message properties... to obtain further information about this message Note that profile window (below) adjusts itself to selected timeslice! 19

20 Next Chart option: Qualitative Timeline This will give cosynchronous information for e.g., transfer rates other quantities selectable via context menu: transfer duration transfer volume selectable events can be function events messages (as here) collective operations 20

21 Quantitative Timeline: Accumulated measure of activities This answers the question: How many CPUs are presently engaged in each activity? Example: Yellow arrow shows interval where application does communication exclusively Note: can remove activities via the context menu if too cluttered 21

22 The final Chart type: Message profile gives you metrics for message volume throughput time count senders (horizontal) vs. receivers (vertical) Note: color codes enable you to easily find sore points non-dense communication patterns are good 22

23 Some advanced features: The Vampirtrace API Controlling tracefile output MPI message checking

24 User-Level instrumentation: The ITC API (1) Subroutine library: Control profiling process from within the program Define own activity classes and their members Define performance counters (not discussed here) Include Files Fortran: VT.inc C: VT.h Fortran calls differ from C calls In this presentation, reference to Fortran calls refer to user s guide for further information VT API has changed over time May need to update existing instrumentation for newer releases 24

25 User-Level instrumentation: The ITC API (2) initialize / finalize VTINIT(ierr) is automatically called from MPI_INIT(...) for tracing of non-mpi programs: explicit call is required use libvtsp.a if available VTFINI(ierr) called from MPI_FINALIZE(...) control: VTTRACEON(), VTTRACEOFF() switch tracing on/off VTFLUSH(ierr) write memory to flush file default integers user-defined states group subroutine calls into class of activities... call VTCLASSDEF( & & 'mylib',mylib_handle,ierr)... and then add symbols for each state (subroutine?) call VTFUNCDEF( & & 'mystate1',mylib_handle, & & mystate1_handle,ierr) (etc.)... until group is complete 25

26 User-Level instrumentation: The ITC API (3) Actual measurement start with call VTBEGIN( & & mystate1_handle,ierr) stop with call VTEND( & & mystate1_handle,ierr) nested calls are possible... but no overlaps! name_handle global variable program prog include 'VT.inc'... [ call vttraceoff() ] call mpi_init(...) call vtclassdef( & 'mylib',mylib_handle,ierr) call vtfuncdef('name', & mylib_handle,name_handle,ierr) [ call vttraceon(ierr) ]... subroutine name(...) include 'VT.inc' (declarations) call vtbegin(name_handle,ierr) (executable statements) call vtend(name_handle,ierr) 26

27 Controlling tracefile output (1) Environment Variables: VT_CONFIG: Name of configuration file VT_CONFIG_RANK: Rank of MPI process reading configuration file writing trace file Default is MPI process ID 0 Always specify name and format of log file in configuration file! Change rank of process writing trace file: LOGFILE-RANK <number> entry in configuration file Tracefile production: trace data are stored in memory buffers, possibly controlled by configuration entries: x MEM-BLOCKSIZE: size of buffers MEM-MAXBLOCKS: maximum number of memory buffers VT should not use too much memory (may disrupt application!) Exhaustion of memory buffers can happen after some time What to do? Applications which hang or crash? 64 kbyte 0 = unlimited 27

28 Controlling tracefile output (2) coping with buffer overrun a) Flush data to disk: default action (AUTOFLUSH is on) possibly use suitable MEM-FLUSHBLOCKS value to trigger background flushing b) Overwrite from beginning: only last part of tracefile written. AUTOFLUSH off MEM-OVERWRITE on c) Stop trace collection only first part of tracefile written AUTOFLUSH off MEM-OVERWRITE off (default) coping with tracefile size Long runs may produce tens (or hundreds) GBytes! STF reduces difficulties with visualizing this Frame definition (not discussed here) Reduction of size: Insert VTtraceon() and VTtraceoff() calls Use activity/symbol filtering via configuration file (e. g., no administrative MPI calls) Use b) or c) shown left Trace subset of processes PROCESS 0:N off PROCESS 0:N:3 on 28

29 Controlling tracefile output (3) Treatment of crashing applications In this case: replace the vtrace switch by vtrace_fs ( failsafe tracing ) presently available only for Fortran on the Altix Failures: replacement of tracing library Signals (SIGINT, SIGTERM) SIGKILL not caught Premature exit of processes without MPI_Finalize MPI error (comm. problems, wrong parameters) What is done? freezes MPI processes writing done via TCP sockets SIGINT is sent after writing data 30

30 Controlling tracefile output (4) Treatment of hanging applications Deadlock detection is automatically performed If ITC observes no progress for a certain amount of time in any process, then it assumes a deadlock stops the application and writes a trace file. Timeout is configurable via DEADLOCK-TIMEOUT no progress is defined as inside the same MPI call - obviously this is just a heuristic approach and may fail: If all processes remain in MPI for a long time e.g. due to a long data transfer, then the timeout might be reached premature abort default timeout is 5 minutes 31

31 Further functionality Recording source location enables you to dive into source from message line or function in the GUI potentially very large performance overhead presently only for GCC based MPI Automatic subroutine tracing class application can be ungrouped no manual instrumentation needed Use the tcollect switch in addition to vtrace Intel compiler 10.0 or higher is needed 32

32 MPI message checking Error detection for MPI code also makes use of MPI profiling interface supported module stack on LRZ systems: module unload mpi_tracing module unload mpi.parastation mpi.altix # may need to unload further modules module load mpi.intel module load mpi_tracing completely recompile application No static linkage Run executable with LD_PRELOAD set: mpiexec -genv LD_PRELOAD libvtmc.so \ -n [No. of tasks]./myprog.exe Run executable with LD_PRELOAD set: report is written to stderr check lines marked ERROR or WARNING Further environment variables for execution control VT_DEADLOCK_TIMEOUT default 60 sec VT_DEADLOCK_WARNING default 300 s VT_CHECK_MAX_ERRORS default 1 VT_CHECK_MAX_REPORTS default 0 (unlimited) use genv to propagate 33

33 Message checking: Example output for insufficient buffering [0] INFO: CHECK LOCAL:EXIT:SIGNAL ON many more info lines [0] ERROR: LOCAL:BUFFER:INSUFFICIENT_BUFFER: error [0] ERROR: Buffer [0x , 0x e[ of size 110 cannot store message of size 111. [0] ERROR: Free space [0x , 0x e[, 110 bytes. [0] ERROR: Check buffer handling (use larger buffer in MPI_Buffer_attach(), [0] ERROR: receive the oldest message(s) to free up space before buffering new ones, [0] ERROR: check for race conditions between buffering and receiving messages, etc). [0] ERROR: Note that message sizes are calculated using the worst-case scenario that [0] ERROR: the application has to be prepared for: MPI_Pack_size() + MPI_BSEND_OVERHEAD. [0] ERROR: New message of size 111 was to be sent by: [0] ERROR: MPI_Bsend(*buf=0x , count=16, datatype=mpi_char, dest=1, tag=100, comm=mpi_comm_world) [0] ERROR: testpairs (/home/cluster/a2832ba/size.c:85) [0] ERROR: wrongbuffer (/home/cluster/a2832ba/size.c:358) [0] ERROR: main (/home/cluster/a2832ba/size.c:378) [0] ERROR: libc_start_main (/lib/libc-2.4.so) [0] ERROR: _start (/home/cluster/a2832ba/size) [0] INFO: 1 error, limit CHECK-MAX-ERRORS reached => aborting points out source lines and faulty MPI call provides recommendations on fixing the problem 34

34 MPI checking: Final remarks further possible environment settings: deduce from <program>.prot deadlock detection for large number of MPI tasks possibly need to readjust timeouts else large messages may provoke false negative Documentation check the Intel Trace Collector User s Guide on x86: can also integrate Valgrind for memory checking 35

Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr

Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr Research Centre Juelich (FZJ) John von Neumann Institute of Computing (NIC) Central Institute for Applied Mathematics (ZAM) 52425