Profiling Parallel Performance using Vampir and Paraver

Size: px

Start display at page:

Download "Profiling Parallel Performance using Vampir and Paraver"

Virgil Robinson
5 years ago
Views:

1 Profiling Parallel Performance using Vampir and Paraver Andrew Sunderland, Andrew Porter STFC Daresbury Laboratory, Warrington, WA4 4AD Abstract Two popular parallel profiling tools installed on HPCx are Vampir and Paraver, which are also widely available on other platforms. These tools can simultaneously monitor hardware counters and track message-passing calls, providing valuable information on an application's runtime behaviour which can be used to improve its performance. In this report we look at using these tools in practice on a number of different applications on HPCx, with the aim of showing users how to utilise such profilers to help them understand the behaviour of their own codes. As part of this, we also examine the use of Vampir for codes run on large numbers of processes (64 or more). Interested parties should check back here regularly for updates to this paper. This is a Technical Report from the HPCx Consortium. Report available from HPCx UoE Ltd 2007 Neither HPCx UoE Ltd nor its members separately accept any responsibility for loss or damage arising from the use of information contained in any of their reports or in any communication about their tests or investigations.

2 Profiling Parallel Performance Using Vampir and Paraver ii 1 Introduction 3 2 Background to Profilers VAMPIR & VAMPIRTRACE Product History PARAVER 4 3 Background to Applications DL-POLY NEMO PDSYEVR LU decomposition using OpenMP 6 4 VAMPIR Performance Analysis on HPCx Installation VampirTrace version Vampir Tracing the Application Code on HPCx Automatic Instrumentation Manual Instrumentation using the VampirTrace API Running the application with tracing on HPCx Hardware Event Counter Monitoring with PAPI Analysing DL_POLY VampirTrace files with Vampir Vampir Summary Chart Vampir Activity Chart Global Timeline View Analysing parallel 3D FFT performance in DL_POLY Profiling the NEMO application on large process counts using Vampir on HPCx Identifying Load Imbalances in the Development of PDSYEVR 18 5 PARAVER performance analysis on HPCx Setting up Paraver Tracing on HPCx Viewing Paraver tracefiles on HPCx Analysing the LUS2 application using Paraver 21 6 Summary 24 7 References 24

3 Profiling Parallel Performance Using Vampir and Paraver 3 1 Introduction The performance of a parallel code is commonly dependent on a complex combination of factors. It is therefore important that developers of High Performance Computing applications have access to effective tools for collecting and analysing performance data. This data can be used to identify such issues as computational and communication bottlenecks, load imbalances and inefficient cpu utilization. In this report we investigate the use of Vampir (Visualization and Analysis of MPI Resources) [1] in association with its related tracing tool VampirTrace [2] and Paraver (Parallel Program and Visualization Analysis Tool) [3]. HPCx usage is demonstrated here by applying the tools to the parallel DL_POLY 3 [4] application code, the computation core of a new symmetric parallel eigensolver PDSYEVR [5], an LU decomposition code [6] parallelised using OpenMP [7] and the NEMO oceanmodelling code ARPDBG [8]. It is not intended that this report should be referenced as a user guide for the tools investigated. For this there are excellent documents that can be found at the respective tools websites that detail the huge amount of features available. Rather this report is intend to give users a quick introduction to getting started with using the tools on HPCx and to demonstrate, with the aid of application examples, some of the in-depth analysis that can be enabled. 2 Background to Profilers Both analysis tools involve similar approaches i.e. analysis of a specific tracefile created at the application s runtime that contains information on the various calls and events undertaken. For tracing the application code VampirTrace requires a relinking of the application code to the VampirTrace libraries whereas Paraver-based tracing does not require any relinking of the code, only execution via the OMPItrace tool. Both VampirTrace, OMPItrace can produce a tracefile for an OpenMP program, an MPI program, or a mixed-mode OpenMP and MPI program. Both tools require licenses and environment variable settings can be used to customize the tracing events that are to be recorded. 2.1 VAMPIR & VAMPIRTRACE Vampir (Visualisation and Analysis of MPI Resources) [1] is a commercial postmortem trace visualisation tool from the Center for Information Services and High Performance Computing (ZIH) of TU Dresden [2]. In collaboration with the KOJAK project at ZAM/FZ Jülich [9], the freely available VampirTrace is obtainable from the same organization. The tool uses profiling extensions to MPI and permits analysis of the message events where data is passed between processors during execution of a parallel program. Event ordering, message lengths and times can all be analysed. The latest version (5.0) features support for OpenMP events and hardware performance counters. The tool comes in two components - VampirTrace and Vampir. The first of these

4 Profiling Parallel Performance Using Vampir and Paraver 4 includes a library which when linked and called from a parallel program, produces an event tracefile. Common events include the entering and leaving of function calls and the sending and receiving of MPI messages. By using keywords, application-specific information can be built into the trace using subroutine calls. Trace calls can be automatically applied to the whole run-time or manually added around time-critical program sections. This involves adding calls to VT_USER_START ( label ) and VT_USER_END ( label ) at the section of interest in the source. Automatic instrumentation requires only a re-link of the application code with the VT libraries, whilst manual instrumentation requires a re-compilation of the program. Then Vampir itself is used to convert the trace information into a variety of graphical views, e.g. timeline displays showing state changes and communication, profiling statistics displaying the execution times of routines, communication statistics indicating volumes and transmission rates and more Product History The Vampir tool has been developed at the Center for Applied Mathematics of Research Center Jülich and the Center for High Performance Computing of the Technische Universität Dresden. Vampir is available as a commercial product since 1996 and has been enhanced in the scope of many research and development projects. In the past, it was distributed by the German Pallas GmbH which became later a part of Intel Corporation. The cooperation with Intel ended in Vampir has been widely used in the high performance computing community for many years. A growing number of performance monitoring environments such as TAU [10], KOJAK [9], can produce tracefiles that are now readable by Vampir. Since the release of version 5.0, Vampir supports the new Open Trace Format (OTF), also developed by ZIH. This trace format is especially designed for massively parallel programs. Vampir is portable across many platforms due to its X-based graphical user interface and is available for many computing platforms. 2.2 PARAVER The Paraver performance analysis tool is developed by The European Center for Parallelism of Barcelona (CEPBA) [11] at the Technical University of Catalonia. Based on an easy-to-use Motif GUI, Paraver has been developed to respond to the need to have a qualitative global perception of the application behaviour by visual inspection and then to be able to focus on the detailed quantitative analysis of the problems. Paraver provides a large amount of information useful to improve the decisions on whether and where to invert the programming effort to optimize an application.

5 Profiling Parallel Performance Using Vampir and Paraver 5 3 Background to Applications 3.1 DL-POLY 3 DL_POLY [4] is a parallel molecular dynamics simulation package developed at STFC's Daresbury Laboratory [12]. DL_POLY 3 is the most recent version (2001) and exploits a linked cell algorithm for domain decomposition, suitable for very large systems (up to order 1,000,000 particles) of reasonably uniform density. Computationally the code is characterised by a series of timestep calculations involving exchanges of short-range forces between particles and long-range forces between domains using 3 dimensional FFTs. The computation of these 3D FFTs [13] are a major expense during the computation. Depending on the general integration flavour a DL_POLY_3 timestep can be considered to comprise of the following stages: integration part 1, particle exchange, halo reconstruction, force evaluation, integration part 2. The most communication expensive operation is the particle exchange stage since it involves recovery of the topology of bonded interactions for particles crossing domains. Metal interactions are evaluated by using tabulated data and involve a halo exchange data as they depend on the local density. The test case examined here is a molecular simulation involving dipalmitoylphosphatidylcholine (DPPC) in water. This system is of interest due to its complex forcefield, containing many bonded interactions including constraints as well as vdw and Coulomb charges. 3.2 NEMO NEMO (Nucleus for European Modelling of the Ocean) [8] is designed for the simulation of both regional and global ocean circulation and is developed at the Laboratoire d'océanographie Dynamique et de Climatologie at the Institut Pierre Simon Laplace. It solves a primitive-equation model of the ocean system in three dimensions using a finite-difference scheme and contains sea-ice and passive-tracer models. Originally designed for vector machines, the most recent version uses MPI in its MPP implementation. Here we discuss how VAMPIR may be used to analyse a code's performance on processor counts of up to 256 using NEMO as an example. 3.3 PDSYEVR In the 90s, Dhillon and Parlett devised a new algorithm (Multiple Relatively Robust Representations MRRR) [14] for computing numerically orthogonal eigenvectors of a symmetric tridiagonal matrix with O(n 2 ) cost. Recently a ScaLAPACK [15] implementation of this algorithm named PDSYEVR has been developed and it is planned that this routine will be incorporated into future releases of ScaLAPACK. Analysis of some of the subroutines from initial versions of this code with Vampir helped identify performance issues on HPCx, which were later rectified by the developers.

6 Profiling Parallel Performance Using Vampir and Paraver LU decomposition using OpenMP LUS2 is a short Fortran program that calculates an LU decomposition on a dense matrix. Parallelisation of the LU algorithm is achieved by using OpenMP Fortran interface directives, in particular PARALLEL DO loop directives as in the construct that loops through the rows and columns of a matrix as shown below: C$OMP PARALLEL DO SCHEDULE(DYNAMIC,16), PRIVATE(j) do i=1, ISIZE do j=1, ISIZE D(i,j) = A(i,j) + B(i,j) enddo enddo C$OMP END PARALLEL DO 4 VAMPIR Performance Analysis on HPCx 4.1 Installation VampirTrace version The source files for VampirTrace version can be downloaded free of charge from: (search for VampirTrace from the home page). In order to install a 64-bit version of VampirTrace on HPCx the following compiler options were used:./configure AR="ar -X32_64" CC=xlc_r CXX=xlC_r F77=xlf_r FC=xlf90_r MPICC=mpcc_r CFLAGS="-O2 -g -q64" CXXFLAGS="-O2 -g -q64" FFLAGS="-O2 -g -q64" FCFLAGS="-O2 -g -q64" The following configuration options were also required in order to link to IBM s Parallel Operating System (poe), IBM s Message Passing Interface (MPI) library and to access hardware event counter monitoring via the Performance Application Programming Interface (PAPI): --with-mpi-inc-dir=/usr/lpp/ppe.poe/include

7 Profiling Parallel Performance Using Vampir and Paraver 7 --with-mpi-lib-dir=/usr/lpp/ppe.poe/lib --with-mpi-lib=-lmpi_r --with-papidir=/usr/local/packages/papi/papi bit --with-papi-lib="-lpapi64 -lpmapi" Vampir A pre-compiled binary of Vampir 5.0 for AIX is available for download from the Vampir website: NB: This download is a demonstration copy only and a permanent Vampir 5.0 installation is at present unavailable to users on HPCx. Vampir 5.0 is a GUI-based product and therefore it is intended for users to provide their own copy of Vampir 5.0 installed on their remote platforms. This can then be used to view tracefiles of parallel runs from HPCx locally. However a fully featured permanent copy of Vampir 4.3 is installed on HPCx. Users should also note that previous versions of Vampir cannot read tracefiles obtained from Vampirtrace 5.0 as they are incompatible with the new otf (open tracefile format). 4.2 Tracing the Application Code on HPCx In order to use the VampirTrace libraries a) calls to switch Vampirtrace on/off are made from the source code (optional) b) the code must be relinked to the VT libraries c) the code is then run (in the normal way under poe) on HPCx Automatic Instrumentation Automatic Instrumentation is the most convenient way to instrument your application. Simply use the special VT compiler wrappers, found in the $VAMPIRTRACE_HOME/bin subdirectory, without any parameters, e.g.: vtf90 prog1.f90 prog2.f90 -o prog In this case the appropriate VT libraries will automatically be linked into the executable and tracing will be applied to the whole executable Manual Instrumentation using the VampirTrace API The VT USER START, VT USER END instrumentation calls can be used to mark any user-defined sequence of statements. Fortran: #include "vt_user.inc"

8 Profiling Parallel Performance Using Vampir and Paraver 8 VT_USER_START( name )... VT_USER_END( name ) C: #include "vt_user.h" VT_USER_START("name");... VT_USER_END("name"); A unique label should be applied as name in order to identify the different sections traced. If a block has several exit points (as it is often the case for functions), all exit points have to be instrumented by VT USER END. The code can then be compiled using the VT compiler wrappers (e.g. vtf90, vtcc) as described above. This approach is particularly advantageous if users wish to profile certain sections of the application code and leave other parts untraced. A selective tracing approach can also reduce the size of the resulting tracefiles considerably which in turn can speed up loading times when it comes to analyzing them in Vampir Running the application with tracing on HPCx The code can then be run in the usual way on HPCx using poe through a loadleveler script. Upon completion a series of tracefiles are produced a numbered *.filt and *.events.z for each process used and a global *.def.z and *.otf file Hardware Event Counter Monitoring with PAPI In order to direct VampirTrace to collect hardware event counter data a $VT_METRICS environment variable must be set in the loadlever job command script specifying which counters should be monitored. A list of all counters supported by the Performance Application Programming Interface (PAPI) [16] on HPCx can be generated by running the tool 'papi_avail' in the /usr/local/packages/papi/papi bit/share/papi/utils/ directory. A full list is included in this report in Appendix A. Many useful performance metrics are available for analysis, including Floating Point Instruction rates, Integer instruction rates, L1, L2, L3 cache usage statistics and processor load / store instruction rates.

Profiling Parallel Performance Using Vampir and Paraver 9 4.

9 Profiling Parallel Performance Using Vampir and Paraver Analysing DL_POLY VampirTrace files with Vampir The Vampir analyser can be invoked from the command line and the tracefile loaded through the menu options File -> Open Tracefile. The loading operation can take several minutes if the tracefiles are large Vampir Summary Chart The first analysis window to be generated is the Summary Chart, shown below in Figure 1: Figure 1. Vampir Summary Chart The black bar represents the Sum of the overall execution time on HPCx. This time is then broken down into three constituent parts the Application (i.e. computation) time in green, the MPI (i.e. communication) time in red and the VT_API (i.e. tracing overhead) time in blue. These representations are maintained throughout all the different Vampir views described here. From this display users can get an overall impression of the communication / computation ratio in their application code Vampir Activity Chart A useful way of identifying load imbalances between processors is to view the Global Activity Chart under the Global Displays menu. This view, shown in Figure 2 gives a

The display below is for an eight processor DL_POLY run and it shows that communication and computation are

10 Profiling Parallel Performance Using Vampir and Paraver 10 breakdown of Application / MPI / VT_API ratio for each process involved in the execution of the program. The display below is for an eight processor DL_POLY run and it shows that communication and computation are relatively evenly distributed across the processors and therefore the load balancing is good. Figure 2. Vampir Global Activity Chart Global Timeline View Figure 3. Vampir Global Timeline View

Profiling Parallel Performance Using Vampir and Paraver 11 The Global Timeline gives an overall view of the application s parallel characteristics over the course of the complete tracing interval in

11 Profiling Parallel Performance Using Vampir and Paraver 11 The Global Timeline gives an overall view of the application s parallel characteristics over the course of the complete tracing interval in this case the complete runtime. The time interval is measured along the horizontal access ( minutes here) and the processes are listed vertically. Message passing between processes is represented by the black (point-to-point) and purple (global communication operations) lines that link the process timelines. From the prevalence of purple in the above graphical representation it appears that communication in DL_POLY is mainly global, however this can be somewhat misleading as the purple messages overlay and obscure the black lines at this rather coarse zoom level. The proliferation of red MPI operations in the central part of the timeline could lead viewers to conclude that the code is highly communication intensive. However the above test run has much reduced timesteps compared to a production run and approximately the first twothirds of the global timeline represents a set-up phase that in reality would be substantially less significant. 4.4 Analysing parallel 3D FFT performance in DL_POLY Figure 4 shows how, by zooming in (left click with the mouse) on the right hand portion of the Global Timeline, we can obtain a more representative view of the run. This shows a series of timesteps which include phases of computation (green) separated by a series of global communications at the beginning and the end of each timestep. Here, the 3DFFTs, signified by black and red areas around the middle of each timestep, can just begin to be distinguished. Figure 4. DL_POLY Timesteps in the Global Timeline View

Profiling Parallel Performance Using Vampir and Paraver 12 By now selecting Global Displays -> Counter Timeline the selected (via the $VT_METRICS environment variable) hardware counters can be viewed

12 Profiling Parallel Performance Using Vampir and Paraver 12 By now selecting Global Displays -> Counter Timeline the selected (via the $VT_METRICS environment variable) hardware counters can be viewed on the same scale (Figure 5). Here we have chosen to run the code with the $VT_METRICS=PAPI_FP_OPS environment variable set in the loadleveler script, thereby measuring floating point operations throughout the application. Figure 5. Vampir Hardware Counter Timeline view of DL_POLY timesteps It can be seen that the flops/s rate peaks at around 100 Mflops/s per processor towards the centre of a timestep and reduces to virtually zero during the intensive global communication phases at the end of the timestep. Zooming further in (Figure 6), we can identify the subroutine in the program that the flop rate is at a maximum in the routines parallel_f (ft) (the number after the function name represents the number of times that the function has been called). The associated Counter Timeline is also shown below.

Again, the corresponding counter timeline shows how the Flops/s rate reduces to almost zero during communication-dominated periods and serial performance peaks at

13 Profiling Parallel Performance Using Vampir and Paraver 13 Figure 6. Parallel 3D FFT in DL_POLY Timelines The characteristic communication pattern for a 3D FFT is shown clearly in Figure 6 i.e. pairwise point-to-point communications in firstly the x, then the y, then the z directions. Again, the corresponding counter timeline shows how the Flops/s rate reduces to almost zero during communication-dominated periods and serial performance peaks at around 100 Mflops/s during the FFT computation. A summary of the message passing statistics, highlighting the level of data transfer between processors can also be obtained (Figure 7). This shows how each processor

Message Passing statistics for the 3D FFT Left clicking on any of the black point-to-point message lines in the 3D FFT timeline highlights the

14 Profiling Parallel Performance Using Vampir and Paraver 14 transfers 8 Mbytes of data with three other processors, representing pair-wise communication in the x, y and z directions. Figure 7. Message Passing statistics for the 3D FFT Left clicking on any of the black point-to-point message lines in the 3D FFT timeline highlights the specified message and initiates a pop-up box with more details on this message passing instance. Shown in Figure 8 are the details of the message highlighted at the bottom right corner of the timeline in Figure 6. Figure 8. Individual Message Statistics

Profiling Parallel Performance Using Vampir and Paraver 15 4.

15 Profiling Parallel Performance Using Vampir and Paraver Profiling the NEMO application on large process counts using Vampir on HPCx An immediate drawback of VampirTrace when using large numbers of processes is the size of the trace files produced and consequently the amount of memory (and time) needed by Vampir when loading them. This may be alleviated by reducing the length of the benchmarking run itself (e.g. the number of timesteps that are requested) but ultimately it may be necessary to manually instrument the source code (as described in Section 4.2.2) such that data is only collected about the sections of the code that are of interest. For instance, the scaling performance of a code will not be affected by the performance of any start-up and initialisation routines and yet, for a small benchmarking run, this may take a significant fraction of the runtime. Below we show an example of a summary activity timeline generated by Vampir using a trace file from a manually-instrumented version of the NEMO source code. The large blue areas signify time when the code was not in any of the instrumented regions and is broken only by some initialisation and a region where tracing was (programmatically) switched on for a few timesteps midway through the run before being switched off again. Figure 9. The activity timeline generated from a manually-instrumented version of NEMO. It contains a little initialisation and then data for a few time-steps midway through the run.

Profiling Parallel Performance Using Vampir and Paraver 16 The full trace data for the few timesteps may be loaded by selecting the relevant region from the summary time-line.

16 Profiling Parallel Performance Using Vampir and Paraver 16 The full trace data for the few timesteps may be loaded by selecting the relevant region from the summary time-line. Since the tracing has been programmatically switched on for a set number of timesteps, the information provided by the resulting summary may be reliably compared for different runs since it is not dependent on the area of the activity timeline selected by the user. Below we show an example of such a summary where the code has been manually instrumented. Figure 10. Summary view of trace data for five timesteps of a manually instrumented version of NEMO running on 128 processes. Once the trace data has been loaded, the user often wishes to view the global timeline, an example of which is shown below for a single timestep of NEMO. A useful summary of this view may be obtained by right-clicking and selecting Components->Parallelism Display. This brings up the display visible at the bottom of the figure from which it is easy to determine which sections of the timestep are dominated by e.g. MPI communications (coloured red by default in Vampir). An example here is the section of NEMO dealing with ice transport processes (coloured bright green). Also of note in this example is the dominance of global communications (coloured purple) over the last 16 processes. It turns out that these processes have been allocated the region of the globe in the vicinity of the poles and thus have extra work to do in removing noise introduced by the small mesh size in this region.

Profiling Parallel Performance Using Vampir and Paraver 17 Figure 11. A global timeline for each of 64 processors during a single timestep of NEMO.

The usefulness of the global timeline can be limited when looking at tracefiles for numbers of processors greater than 64 as Vampir will try to scale the data for each process so as to fit them all

17 Profiling Parallel Performance Using Vampir and Paraver 17 Figure 11. A global timeline for each of 64 processors during a single timestep of NEMO. A 'parallelism' display is included at the bottom showing the percentage of the processors involved in each activity at any one time. The usefulness of the global timeline can be limited when looking at tracefiles for numbers of processors greater than 64 as Vampir will try to scale the data for each process so as to fit them all on screen. However, one can specify the number of process timelines one would like displayed at a time by right-clicking on the display and selecting Options->Show Subset... This brings up the Show Subset Dialog: Figure 12. The Show Subset Dialog for the global timeline. Use this to choose the number of processors ( Bars ) for which data is shown on the timeline.

Profiling Parallel Performance Using Vampir and Paraver 18 Using this dialog one can look at the application's behaviour in detail on a few processes or look at the overall behaviour on many

18 Profiling Parallel Performance Using Vampir and Paraver 18 Using this dialog one can look at the application's behaviour in detail on a few processes or look at the overall behaviour on many processes. The figure below shows VIPAR displaying the activity of the majority of 128 processes during the section of the code dealing with ice rheology in NEMO. The effect of the LPARS on HPCx (effectively 16-way SMP nodes) on interprocess communications is highlighted by the fact that groups of 16 well-synchronised processes may be identified. Figure 13. The global timeline configured to show data for the majority of the 128 processors of the job. 4.6 Identifying Load Imbalances in the Development of PDSYEVR Profiling early versions of the new ScaLapack routine PDSYEVR with VampirTrace (VT) allows us to investigate its performance in detail. Basic timing analysis of the code revealed that load-balancing problems may exist for certain datasets in the eigenvector calculation stage of the underlying tridiagonal eigensolver MRRR. The Vampir analyses shown below enabled us to track this potential inefficiency with great precision.

Profiling Parallel Performance Using Vampir and Paraver 19 In order to track the code in more detail, here different colours were assigned to different functions using the syntax which is described

19 Profiling Parallel Performance Using Vampir and Paraver 19 In order to track the code in more detail, here different colours were assigned to different functions using the syntax which is described in $VT_HOME/info/GROUPS.SPEC. Some additions to the underlying source code are required and a re-compilation must be undertaken. In the timeline view shown in Figure 14 the cyan areas are set to represent computation in the subroutine DLARRV. This routine is involved in the calculation of eigenvectors. As usual, time spent in communication is represented by the red areas in the timeline and the purple lines represent individual messages passed between processors. Figure 14. Vampir Timeline for original DLARRV subroutine The above timeline trace shows that when calculating half the subset of eigenvalues, the workload balance in DLARRV increases substantially from process 0 to process 14. This causes a large communication overhead, represented by the large red areas in the trace. Following this, it was determined that the load imbalance was primarily caused by an unequal division of eigenvectors amongst the processes. These problems were addressed by the Scalapack developers and a newer version of the code gave a much better division of workload, as can be seen in the timeline traces in Figure 15.

Profiling Parallel Performance Using Vampir and Paraver 20 Figure 15. Vampir Timeline for modified DLARRV subroutine 5 PARAVER performance analysis on HPCx 5.

20 Profiling Parallel Performance Using Vampir and Paraver 20 Figure 15. Vampir Timeline for modified DLARRV subroutine 5 PARAVER performance analysis on HPCx 5.1 Setting up Paraver Tracing on HPCx Paraver uses the tool OMPItrace to generate tracefiles for OpenMP programs, MPI programs, or mixed-mode OpenMP and MPI programs. Users should note that OMPItrace currently only works with 32-bit executables on HPCx and also that OMPItrace uses IBM's DPCL (Dynamic Probe Class Library) which requires a.rhosts file in your home directory that lists all the processor ids on HPCx. Paraver tracefiles are generated on HPCx by adding the environment variables (in e.g. ksh/bash): export OMPITRACE_HOME=/usr/local/packages/paraver export MPTRACE_COUNTGROUP=60 to the Loadleveler job control script the poe command in the LoadLeveler scriptis changed from e.g. poe./prog to $OMPITRACE_HOME/bin/ompitrace -counters -v poe.real./prog

21 Profiling Parallel Performance Using Vampir and Paraver 21 On HPCx poe is in fact a wrapper to the real poe command. In order for OMPITRACE to function correctly on HPCx poe.real must be called directly. 5.2 Viewing Paraver tracefiles on HPCx The following environment variables should be set in the user s login session: export PARAVER_HOME = /usr/local/packages/paraver export MPTRACE_COUNTGROUP=60 During the run, Paraver will have created a temporary trace file for each process (*.mpit and *.sim files). After the run has completed the user must submit an instruction to pack the individual profile files into one global output. This is undertaken by issuing the command: $PARAVER_HOME/bin/ompi2prv *.mpit -s *.sym -o trace_prm.prv To view the resulting tracefile use the command: $PARAVER_HOME/bin/paraver trace_prm.prv 5.3 Analysing the LUS2 application using Paraver Unlike Vampir, upon starting Paraver, users are immediately shown the Global Timeline view. This parallelisation of LUS2 is based on OpenMP, therefore threads rather than processes are listed on the vertical axis against time on the horizontal axis. Increasing the zoom in a representative section of the trace shows:

22 Profiling Parallel Performance Using Vampir and Paraver 22 Figure 16. Paraver Timeline for two cycles of $OMP PARALLEL DO The default colours assigned represent the following activities: Figure 17. Colour properties in Paraver

Profiling Parallel Performance Using Vampir and Paraver 23 The trace in

undertaking $OMP PARALLEL DO construct across the matrix as described in

3. It can be seen that relatively large swathes of blue, representing

23 Profiling Parallel Performance Using Vampir and Paraver 23 The trace in Figure 16 shows a typical slice of the timeline from lus2, where the code is undertaking $OMP PARALLEL DO construct across the matrix as described in section 3.3. It can be seen that relatively large swathes of blue, representing computation, are divided by thread administration tasks at the start and end of each $OMP PARALLEL DO cycle. Figure 18. Detailed view of OMP thread scheduling in LUS2

24 Profiling Parallel Performance Using Vampir and Paraver 24 In Figure 18, above the timeline bar of each thread is a series of green flags, each denoting a change of state in the thread. Clicking on the flag gives a detailed description as shown in the example above. Here it can be seen that thread 16 is firstly undergoing a global synchronisation before being scheduled to run the next cycle of the loop. 6 Summary Profilers can be highly effective tools in the analysis of parallel programs on HPC architectures. They are particularly useful for identifying and measuring the effect of such problems as communication bottlenecks and load imbalances on the efficiency of codes. New versions of these tools also include hardware performance data which facilitates the detailed analysis of serial processor performance within a parallel run. The Vampir and Paraver GUI-based analysis tools allow users to switch with ease from global analyses of the parallel run to very detailed analyses of specific messages, all within the one profiling session. Interoperability of VampirTrace with other profilers such as KOJAK and TAU has now been made possible due to the adoption of the opentracefile format. Acknowledgements The authors would like to thank Matthias Jurenz from TU Dresden, Chris Johnson from EPCC University of Edinburgh, and Ilian Todorov & Ian Bush from STFC Daresbury Laboratory for their help in creating this report. 7 References [1] Vampir Performance Optimization [2] Vampirtrace, ZIH, Technische Universitat, Dresden, [3] Paraver, The European Center for Parallelism of Barcelona, [4] The DL_POLY Simulation Package, W. Smith, STFC Daresbury Laboratory,

25 Profiling Parallel Performance Using Vampir and Paraver 25 [5] PDSYEVR. ScaLAPACK s parallel MRRR algorithm for the symmetric eigenvalue problem, D. Antonelli, C. Vomel, Lapack working note 168, (2005). [6] OMPtrace Tool User s Guide, [7] The OpenMP Application Program Interface, [8] NEMO - Nucleus for European Modelling of the Ocean, [9] KOJAK Automatic Performance Analysis Toolset, Forschungszentrum Julich, [10] TAU Tuning and Analysis Utilities, University of Oregon, [11] The European Center for Parallelism of Barcelona, [12] Science & Technology Facilities Council, [13] A Parallel Implementation of SPME for DL_POLY 3, I. J. Bush and W. Smith, STFC Daresbury Laboratory, [14] A Parallel Eigensolver for Dense Symmetric Matrices based on Multiple Relatively Robust Representations, P.Bientinesi, I.S.Dhillon, R.A.van de Geijn, UT CS Technical Report #TR-03026, (2003) [15] [16] PAPI Performance Application Programming Interface Appendix A The list of available PAPI hardware-counters on HPCx. Test case avail.c: Available events and hardware information Vendor string and code : IBM (-1)

26 Profiling Parallel Performance Using Vampir and Paraver 26 Model string and code : POWER5 (8192) CPU Revision : CPU Megahertz : CPU's in this Node : 16 Nodes in this System : 1 Total CPU's : 16 Number Hardware Counters : 6 Max Multiplex Counters : Name Code Avail Deriv Description (Note) PAPI_L1_DCM 0x Yes Yes Level 1 data cache misses () PAPI_L1_ICM 0x No No Level 1 instruction cache misses () PAPI_L2_DCM 0x Yes No Level 2 data cache misses () PAPI_L2_ICM 0x Yes No Level 2 instruction cache misses () PAPI_L3_DCM 0x Yes Yes Level 3 data cache misses () PAPI_L3_ICM 0x Yes Yes Level 3 instruction cache misses () PAPI_L1_TCM 0x No No Level 1 cache misses () PAPI_L2_TCM 0x No No Level 2 cache misses () PAPI_L3_TCM 0x No No Level 3 cache misses () PAPI_CA_SNP 0x No No Requests for a snoop () PAPI_CA_SHR 0x a No No Requests for exclusive access to shared cache line () PAPI_CA_CLN 0x b No No Requests for exclusive access to clean cache line () PAPI_CA_INV 0x c No No Requests for cache line invalidation () PAPI_CA_ITV 0x d No No Requests for cache line intervention () PAPI_L3_LDM 0x e Yes Yes Level 3 load misses () PAPI_L3_STM 0x f No No Level 3 store misses () PAPI_BRU_IDL 0x No No Cycles branch units are idle () PAPI_FXU_IDL 0x Yes No Cycles integer units are idle () PAPI_FPU_IDL 0x No No Cycles floating point units are idle () PAPI_LSU_IDL 0x No No Cycles load/store units are idle () PAPI_TLB_DM 0x Yes No Data translation lookaside buffer misses ()

27 Profiling Parallel Performance Using Vampir and Paraver 27 PAPI_TLB_IM 0x Yes No Instruction translation lookaside buffer misses () PAPI_TLB_TL 0x Yes Yes Total translation lookaside buffer misses () PAPI_L1_LDM 0x Yes No Level 1 load misses () PAPI_L1_STM 0x Yes No Level 1 store misses () PAPI_L2_LDM 0x Yes No Level 2 load misses () PAPI_L2_STM 0x a No No Level 2 store misses () PAPI_BTAC_M 0x b No No Branch target address cache misses () PAPI_PRF_DM 0x c No No Data prefetch cache misses () PAPI_L3_DCH 0x d No No Level 3 data cache hits () PAPI_TLB_SD 0x e No No Translation lookaside buffer shootdowns () PAPI_CSR_FAL 0x f No No Failed store conditional instructions () PAPI_CSR_SUC 0x No No Successful store conditional instructions () PAPI_CSR_TOT 0x No No Total store conditional instructions () PAPI_MEM_SCY 0x No No Cycles Stalled Waiting for memory accesses () PAPI_MEM_RCY 0x No No Cycles Stalled Waiting for memory Reads () PAPI_MEM_WCY 0x No No Cycles Stalled Waiting for memory writes () PAPI_STL_ICY 0x Yes No Cycles with no instruction issue () PAPI_FUL_ICY 0x No No Cycles with maximum instruction issue () PAPI_STL_CCY 0x No No Cycles with no instructions completed () PAPI_FUL_CCY 0x No No Cycles with maximum instructions completed () PAPI_HW_INT 0x Yes No Hardware interrupts () PAPI_BR_UCN 0x a No No Unconditional branch instructions () PAPI_BR_CN 0x b No No Conditional branch instructions () PAPI_BR_TKN 0x c No No Conditional branch instructions taken () PAPI_BR_NTK 0x d No No Conditional branch instructions not taken () PAPI_BR_MSP 0x e Yes Yes Conditional branch instructions mispredicted ()

28 Profiling Parallel Performance Using Vampir and Paraver 28 PAPI_BR_PRC 0x f No No Conditional branch instructions correctly predicted () PAPI_FMA_INS 0x Yes No FMA instructions completed () PAPI_TOT_IIS 0x Yes No Instructions issued () PAPI_TOT_INS 0x Yes No Instructions completed () PAPI_INT_INS 0x Yes No Integer instructions () PAPI_FP_INS 0x Yes No Floating point instructions () PAPI_LD_INS 0x Yes No Load instructions () PAPI_SR_INS 0x Yes No Store instructions () PAPI_BR_INS 0x Yes No Branch instructions () PAPI_VEC_INS 0x No No Vector/SIMD instructions () PAPI_RES_STL 0x No No Cycles stalled on any resource () PAPI_FP_STAL 0x a No No Cycles the FP unit(s) are stalled () PAPI_TOT_CYC 0x b Yes No Total cycles () PAPI_LST_INS 0x c Yes Yes Load/store instructions completed () PAPI_SYC_INS 0x d No No Synchronization instructions completed () PAPI_L1_DCH 0x e No No Level 1 data cache hits () PAPI_L2_DCH 0x f No No Level 2 data cache hits () PAPI_L1_DCA 0x Yes Yes Level 1 data cache accesses () PAPI_L2_DCA 0x No No Level 2 data cache accesses () PAPI_L3_DCA 0x No No Level 3 data cache accesses () PAPI_L1_DCR 0x Yes No Level 1 data cache reads () PAPI_L2_DCR 0x No No Level 2 data cache reads () PAPI_L3_DCR 0x Yes No Level 3 data cache reads () PAPI_L1_DCW 0x Yes No Level 1 data cache writes () PAPI_L2_DCW 0x No No Level 2 data cache writes () PAPI_L3_DCW 0x No No Level 3 data cache writes () PAPI_L1_ICH 0x Yes No Level 1 instruction cache hits () PAPI_L2_ICH 0x a No No Level 2 instruction cache hits () PAPI_L3_ICH 0x b No No Level 3 instruction cache hits () PAPI_L1_ICA 0x c No No Level 1 instruction cache accesses () PAPI_L2_ICA 0x d No No Level 2 instruction cache accesses () PAPI_L3_ICA 0x e Yes No Level 3 instruction cache accesses () PAPI_L1_ICR 0x f No No Level 1 instruction cache reads ()

29 Profiling Parallel Performance Using Vampir and Paraver 29 PAPI_L2_ICR 0x No No Level 2 instruction cache reads () PAPI_L3_ICR 0x No No Level 3 instruction cache reads () PAPI_L1_ICW 0x No No Level 1 instruction cache writes () PAPI_L2_ICW 0x No No Level 2 instruction cache writes () PAPI_L3_ICW 0x No No Level 3 instruction cache writes () PAPI_L1_TCH 0x No No Level 1 total cache hits () PAPI_L2_TCH 0x No No Level 2 total cache hits () PAPI_L3_TCH 0x No No Level 3 total cache hits () PAPI_L1_TCA 0x No No Level 1 total cache accesses () PAPI_L2_TCA 0x No No Level 2 total cache accesses () PAPI_L3_TCA 0x a No No Level 3 total cache accesses () PAPI_L1_TCR 0x b No No Level 1 total cache reads () PAPI_L2_TCR 0x c No No Level 2 total cache reads () PAPI_L3_TCR 0x d No No Level 3 total cache reads () PAPI_L1_TCW 0x e No No Level 1 total cache writes () PAPI_L2_TCW 0x f No No Level 2 total cache writes () PAPI_L3_TCW 0x No No Level 3 total cache writes () PAPI_FML_INS 0x No No Floating point multiply instructions () PAPI_FAD_INS 0x No No Floating point add instructions () PAPI_FDV_INS 0x Yes No Floating point divide instructions () PAPI_FSQ_INS 0x Yes No Floating point square root instructions () PAPI_FNV_INS 0x No No Floating point inverse instructions () PAPI_FP_OPS 0x Yes Yes Floating point operations () avail.c PASSED

PAPI Software Specification

PAPI Software Specification This software specification describes the PAPI 3.0 Release, and is current as of March 08, 2004. It consists of the following sections: Introduction to PAPI Constants Standardized