Profiling Parallel Performance using Vampir and Paraver

Size: px
Start display at page:

Download "Profiling Parallel Performance using Vampir and Paraver"

Transcription

1 Profiling Parallel Performance using Vampir and Paraver Andrew Sunderland, Andrew Porter STFC Daresbury Laboratory, Warrington, WA4 4AD Abstract Two popular parallel profiling tools installed on HPCx are Vampir and Paraver, which are also widely available on other platforms. These tools can simultaneously monitor hardware counters and track message-passing calls, providing valuable information on an application's runtime behaviour which can be used to improve its performance. In this report we look at using these tools in practice on a number of different applications on HPCx, with the aim of showing users how to utilise such profilers to help them understand the behaviour of their own codes. As part of this, we also examine the use of Vampir for codes run on large numbers of processes (64 or more). Interested parties should check back here regularly for updates to this paper. This is a Technical Report from the HPCx Consortium. Report available from HPCx UoE Ltd 2007 Neither HPCx UoE Ltd nor its members separately accept any responsibility for loss or damage arising from the use of information contained in any of their reports or in any communication about their tests or investigations.

2 Profiling Parallel Performance Using Vampir and Paraver ii 1 Introduction 3 2 Background to Profilers VAMPIR & VAMPIRTRACE Product History PARAVER 4 3 Background to Applications DL-POLY NEMO PDSYEVR LU decomposition using OpenMP 6 4 VAMPIR Performance Analysis on HPCx Installation VampirTrace version Vampir Tracing the Application Code on HPCx Automatic Instrumentation Manual Instrumentation using the VampirTrace API Running the application with tracing on HPCx Hardware Event Counter Monitoring with PAPI Analysing DL_POLY VampirTrace files with Vampir Vampir Summary Chart Vampir Activity Chart Global Timeline View Analysing parallel 3D FFT performance in DL_POLY Profiling the NEMO application on large process counts using Vampir on HPCx Identifying Load Imbalances in the Development of PDSYEVR 18 5 PARAVER performance analysis on HPCx Setting up Paraver Tracing on HPCx Viewing Paraver tracefiles on HPCx Analysing the LUS2 application using Paraver 21 6 Summary 24 7 References 24

3 Profiling Parallel Performance Using Vampir and Paraver 3 1 Introduction The performance of a parallel code is commonly dependent on a complex combination of factors. It is therefore important that developers of High Performance Computing applications have access to effective tools for collecting and analysing performance data. This data can be used to identify such issues as computational and communication bottlenecks, load imbalances and inefficient cpu utilization. In this report we investigate the use of Vampir (Visualization and Analysis of MPI Resources) [1] in association with its related tracing tool VampirTrace [2] and Paraver (Parallel Program and Visualization Analysis Tool) [3]. HPCx usage is demonstrated here by applying the tools to the parallel DL_POLY 3 [4] application code, the computation core of a new symmetric parallel eigensolver PDSYEVR [5], an LU decomposition code [6] parallelised using OpenMP [7] and the NEMO oceanmodelling code ARPDBG [8]. It is not intended that this report should be referenced as a user guide for the tools investigated. For this there are excellent documents that can be found at the respective tools websites that detail the huge amount of features available. Rather this report is intend to give users a quick introduction to getting started with using the tools on HPCx and to demonstrate, with the aid of application examples, some of the in-depth analysis that can be enabled. 2 Background to Profilers Both analysis tools involve similar approaches i.e. analysis of a specific tracefile created at the application s runtime that contains information on the various calls and events undertaken. For tracing the application code VampirTrace requires a relinking of the application code to the VampirTrace libraries whereas Paraver-based tracing does not require any relinking of the code, only execution via the OMPItrace tool. Both VampirTrace, OMPItrace can produce a tracefile for an OpenMP program, an MPI program, or a mixed-mode OpenMP and MPI program. Both tools require licenses and environment variable settings can be used to customize the tracing events that are to be recorded. 2.1 VAMPIR & VAMPIRTRACE Vampir (Visualisation and Analysis of MPI Resources) [1] is a commercial postmortem trace visualisation tool from the Center for Information Services and High Performance Computing (ZIH) of TU Dresden [2]. In collaboration with the KOJAK project at ZAM/FZ Jülich [9], the freely available VampirTrace is obtainable from the same organization. The tool uses profiling extensions to MPI and permits analysis of the message events where data is passed between processors during execution of a parallel program. Event ordering, message lengths and times can all be analysed. The latest version (5.0) features support for OpenMP events and hardware performance counters. The tool comes in two components - VampirTrace and Vampir. The first of these

4 Profiling Parallel Performance Using Vampir and Paraver 4 includes a library which when linked and called from a parallel program, produces an event tracefile. Common events include the entering and leaving of function calls and the sending and receiving of MPI messages. By using keywords, application-specific information can be built into the trace using subroutine calls. Trace calls can be automatically applied to the whole run-time or manually added around time-critical program sections. This involves adding calls to VT_USER_START ( label ) and VT_USER_END ( label ) at the section of interest in the source. Automatic instrumentation requires only a re-link of the application code with the VT libraries, whilst manual instrumentation requires a re-compilation of the program. Then Vampir itself is used to convert the trace information into a variety of graphical views, e.g. timeline displays showing state changes and communication, profiling statistics displaying the execution times of routines, communication statistics indicating volumes and transmission rates and more Product History The Vampir tool has been developed at the Center for Applied Mathematics of Research Center Jülich and the Center for High Performance Computing of the Technische Universität Dresden. Vampir is available as a commercial product since 1996 and has been enhanced in the scope of many research and development projects. In the past, it was distributed by the German Pallas GmbH which became later a part of Intel Corporation. The cooperation with Intel ended in Vampir has been widely used in the high performance computing community for many years. A growing number of performance monitoring environments such as TAU [10], KOJAK [9], can produce tracefiles that are now readable by Vampir. Since the release of version 5.0, Vampir supports the new Open Trace Format (OTF), also developed by ZIH. This trace format is especially designed for massively parallel programs. Vampir is portable across many platforms due to its X-based graphical user interface and is available for many computing platforms. 2.2 PARAVER The Paraver performance analysis tool is developed by The European Center for Parallelism of Barcelona (CEPBA) [11] at the Technical University of Catalonia. Based on an easy-to-use Motif GUI, Paraver has been developed to respond to the need to have a qualitative global perception of the application behaviour by visual inspection and then to be able to focus on the detailed quantitative analysis of the problems. Paraver provides a large amount of information useful to improve the decisions on whether and where to invert the programming effort to optimize an application.

5 Profiling Parallel Performance Using Vampir and Paraver 5 3 Background to Applications 3.1 DL-POLY 3 DL_POLY [4] is a parallel molecular dynamics simulation package developed at STFC's Daresbury Laboratory [12]. DL_POLY 3 is the most recent version (2001) and exploits a linked cell algorithm for domain decomposition, suitable for very large systems (up to order 1,000,000 particles) of reasonably uniform density. Computationally the code is characterised by a series of timestep calculations involving exchanges of short-range forces between particles and long-range forces between domains using 3 dimensional FFTs. The computation of these 3D FFTs [13] are a major expense during the computation. Depending on the general integration flavour a DL_POLY_3 timestep can be considered to comprise of the following stages: integration part 1, particle exchange, halo reconstruction, force evaluation, integration part 2. The most communication expensive operation is the particle exchange stage since it involves recovery of the topology of bonded interactions for particles crossing domains. Metal interactions are evaluated by using tabulated data and involve a halo exchange data as they depend on the local density. The test case examined here is a molecular simulation involving dipalmitoylphosphatidylcholine (DPPC) in water. This system is of interest due to its complex forcefield, containing many bonded interactions including constraints as well as vdw and Coulomb charges. 3.2 NEMO NEMO (Nucleus for European Modelling of the Ocean) [8] is designed for the simulation of both regional and global ocean circulation and is developed at the Laboratoire d'océanographie Dynamique et de Climatologie at the Institut Pierre Simon Laplace. It solves a primitive-equation model of the ocean system in three dimensions using a finite-difference scheme and contains sea-ice and passive-tracer models. Originally designed for vector machines, the most recent version uses MPI in its MPP implementation. Here we discuss how VAMPIR may be used to analyse a code's performance on processor counts of up to 256 using NEMO as an example. 3.3 PDSYEVR In the 90s, Dhillon and Parlett devised a new algorithm (Multiple Relatively Robust Representations MRRR) [14] for computing numerically orthogonal eigenvectors of a symmetric tridiagonal matrix with O(n 2 ) cost. Recently a ScaLAPACK [15] implementation of this algorithm named PDSYEVR has been developed and it is planned that this routine will be incorporated into future releases of ScaLAPACK. Analysis of some of the subroutines from initial versions of this code with Vampir helped identify performance issues on HPCx, which were later rectified by the developers.

6 Profiling Parallel Performance Using Vampir and Paraver LU decomposition using OpenMP LUS2 is a short Fortran program that calculates an LU decomposition on a dense matrix. Parallelisation of the LU algorithm is achieved by using OpenMP Fortran interface directives, in particular PARALLEL DO loop directives as in the construct that loops through the rows and columns of a matrix as shown below: C$OMP PARALLEL DO SCHEDULE(DYNAMIC,16), PRIVATE(j) do i=1, ISIZE do j=1, ISIZE D(i,j) = A(i,j) + B(i,j) enddo enddo C$OMP END PARALLEL DO 4 VAMPIR Performance Analysis on HPCx 4.1 Installation VampirTrace version The source files for VampirTrace version can be downloaded free of charge from: (search for VampirTrace from the home page). In order to install a 64-bit version of VampirTrace on HPCx the following compiler options were used:./configure AR="ar -X32_64" CC=xlc_r CXX=xlC_r F77=xlf_r FC=xlf90_r MPICC=mpcc_r CFLAGS="-O2 -g -q64" CXXFLAGS="-O2 -g -q64" FFLAGS="-O2 -g -q64" FCFLAGS="-O2 -g -q64" The following configuration options were also required in order to link to IBM s Parallel Operating System (poe), IBM s Message Passing Interface (MPI) library and to access hardware event counter monitoring via the Performance Application Programming Interface (PAPI): --with-mpi-inc-dir=/usr/lpp/ppe.poe/include

7 Profiling Parallel Performance Using Vampir and Paraver 7 --with-mpi-lib-dir=/usr/lpp/ppe.poe/lib --with-mpi-lib=-lmpi_r --with-papidir=/usr/local/packages/papi/papi bit --with-papi-lib="-lpapi64 -lpmapi" Vampir A pre-compiled binary of Vampir 5.0 for AIX is available for download from the Vampir website: NB: This download is a demonstration copy only and a permanent Vampir 5.0 installation is at present unavailable to users on HPCx. Vampir 5.0 is a GUI-based product and therefore it is intended for users to provide their own copy of Vampir 5.0 installed on their remote platforms. This can then be used to view tracefiles of parallel runs from HPCx locally. However a fully featured permanent copy of Vampir 4.3 is installed on HPCx. Users should also note that previous versions of Vampir cannot read tracefiles obtained from Vampirtrace 5.0 as they are incompatible with the new otf (open tracefile format). 4.2 Tracing the Application Code on HPCx In order to use the VampirTrace libraries a) calls to switch Vampirtrace on/off are made from the source code (optional) b) the code must be relinked to the VT libraries c) the code is then run (in the normal way under poe) on HPCx Automatic Instrumentation Automatic Instrumentation is the most convenient way to instrument your application. Simply use the special VT compiler wrappers, found in the $VAMPIRTRACE_HOME/bin subdirectory, without any parameters, e.g.: vtf90 prog1.f90 prog2.f90 -o prog In this case the appropriate VT libraries will automatically be linked into the executable and tracing will be applied to the whole executable Manual Instrumentation using the VampirTrace API The VT USER START, VT USER END instrumentation calls can be used to mark any user-defined sequence of statements. Fortran: #include "vt_user.inc"

8 Profiling Parallel Performance Using Vampir and Paraver 8 VT_USER_START( name )... VT_USER_END( name ) C: #include "vt_user.h" VT_USER_START("name");... VT_USER_END("name"); A unique label should be applied as name in order to identify the different sections traced. If a block has several exit points (as it is often the case for functions), all exit points have to be instrumented by VT USER END. The code can then be compiled using the VT compiler wrappers (e.g. vtf90, vtcc) as described above. This approach is particularly advantageous if users wish to profile certain sections of the application code and leave other parts untraced. A selective tracing approach can also reduce the size of the resulting tracefiles considerably which in turn can speed up loading times when it comes to analyzing them in Vampir Running the application with tracing on HPCx The code can then be run in the usual way on HPCx using poe through a loadleveler script. Upon completion a series of tracefiles are produced a numbered *.filt and *.events.z for each process used and a global *.def.z and *.otf file Hardware Event Counter Monitoring with PAPI In order to direct VampirTrace to collect hardware event counter data a $VT_METRICS environment variable must be set in the loadlever job command script specifying which counters should be monitored. A list of all counters supported by the Performance Application Programming Interface (PAPI) [16] on HPCx can be generated by running the tool 'papi_avail' in the /usr/local/packages/papi/papi bit/share/papi/utils/ directory. A full list is included in this report in Appendix A. Many useful performance metrics are available for analysis, including Floating Point Instruction rates, Integer instruction rates, L1, L2, L3 cache usage statistics and processor load / store instruction rates.

9 Profiling Parallel Performance Using Vampir and Paraver Analysing DL_POLY VampirTrace files with Vampir The Vampir analyser can be invoked from the command line and the tracefile loaded through the menu options File -> Open Tracefile. The loading operation can take several minutes if the tracefiles are large Vampir Summary Chart The first analysis window to be generated is the Summary Chart, shown below in Figure 1: Figure 1. Vampir Summary Chart The black bar represents the Sum of the overall execution time on HPCx. This time is then broken down into three constituent parts the Application (i.e. computation) time in green, the MPI (i.e. communication) time in red and the VT_API (i.e. tracing overhead) time in blue. These representations are maintained throughout all the different Vampir views described here. From this display users can get an overall impression of the communication / computation ratio in their application code Vampir Activity Chart A useful way of identifying load imbalances between processors is to view the Global Activity Chart under the Global Displays menu. This view, shown in Figure 2 gives a

10 Profiling Parallel Performance Using Vampir and Paraver 10 breakdown of Application / MPI / VT_API ratio for each process involved in the execution of the program. The display below is for an eight processor DL_POLY run and it shows that communication and computation are relatively evenly distributed across the processors and therefore the load balancing is good. Figure 2. Vampir Global Activity Chart Global Timeline View Figure 3. Vampir Global Timeline View

11 Profiling Parallel Performance Using Vampir and Paraver 11 The Global Timeline gives an overall view of the application s parallel characteristics over the course of the complete tracing interval in this case the complete runtime. The time interval is measured along the horizontal access ( minutes here) and the processes are listed vertically. Message passing between processes is represented by the black (point-to-point) and purple (global communication operations) lines that link the process timelines. From the prevalence of purple in the above graphical representation it appears that communication in DL_POLY is mainly global, however this can be somewhat misleading as the purple messages overlay and obscure the black lines at this rather coarse zoom level. The proliferation of red MPI operations in the central part of the timeline could lead viewers to conclude that the code is highly communication intensive. However the above test run has much reduced timesteps compared to a production run and approximately the first twothirds of the global timeline represents a set-up phase that in reality would be substantially less significant. 4.4 Analysing parallel 3D FFT performance in DL_POLY Figure 4 shows how, by zooming in (left click with the mouse) on the right hand portion of the Global Timeline, we can obtain a more representative view of the run. This shows a series of timesteps which include phases of computation (green) separated by a series of global communications at the beginning and the end of each timestep. Here, the 3DFFTs, signified by black and red areas around the middle of each timestep, can just begin to be distinguished. Figure 4. DL_POLY Timesteps in the Global Timeline View

12 Profiling Parallel Performance Using Vampir and Paraver 12 By now selecting Global Displays -> Counter Timeline the selected (via the $VT_METRICS environment variable) hardware counters can be viewed on the same scale (Figure 5). Here we have chosen to run the code with the $VT_METRICS=PAPI_FP_OPS environment variable set in the loadleveler script, thereby measuring floating point operations throughout the application. Figure 5. Vampir Hardware Counter Timeline view of DL_POLY timesteps It can be seen that the flops/s rate peaks at around 100 Mflops/s per processor towards the centre of a timestep and reduces to virtually zero during the intensive global communication phases at the end of the timestep. Zooming further in (Figure 6), we can identify the subroutine in the program that the flop rate is at a maximum in the routines parallel_f (ft) (the number after the function name represents the number of times that the function has been called). The associated Counter Timeline is also shown below.

13 Profiling Parallel Performance Using Vampir and Paraver 13 Figure 6. Parallel 3D FFT in DL_POLY Timelines The characteristic communication pattern for a 3D FFT is shown clearly in Figure 6 i.e. pairwise point-to-point communications in firstly the x, then the y, then the z directions. Again, the corresponding counter timeline shows how the Flops/s rate reduces to almost zero during communication-dominated periods and serial performance peaks at around 100 Mflops/s during the FFT computation. A summary of the message passing statistics, highlighting the level of data transfer between processors can also be obtained (Figure 7). This shows how each processor

14 Profiling Parallel Performance Using Vampir and Paraver 14 transfers 8 Mbytes of data with three other processors, representing pair-wise communication in the x, y and z directions. Figure 7. Message Passing statistics for the 3D FFT Left clicking on any of the black point-to-point message lines in the 3D FFT timeline highlights the specified message and initiates a pop-up box with more details on this message passing instance. Shown in Figure 8 are the details of the message highlighted at the bottom right corner of the timeline in Figure 6. Figure 8. Individual Message Statistics

15 Profiling Parallel Performance Using Vampir and Paraver Profiling the NEMO application on large process counts using Vampir on HPCx An immediate drawback of VampirTrace when using large numbers of processes is the size of the trace files produced and consequently the amount of memory (and time) needed by Vampir when loading them. This may be alleviated by reducing the length of the benchmarking run itself (e.g. the number of timesteps that are requested) but ultimately it may be necessary to manually instrument the source code (as described in Section 4.2.2) such that data is only collected about the sections of the code that are of interest. For instance, the scaling performance of a code will not be affected by the performance of any start-up and initialisation routines and yet, for a small benchmarking run, this may take a significant fraction of the runtime. Below we show an example of a summary activity timeline generated by Vampir using a trace file from a manually-instrumented version of the NEMO source code. The large blue areas signify time when the code was not in any of the instrumented regions and is broken only by some initialisation and a region where tracing was (programmatically) switched on for a few timesteps midway through the run before being switched off again. Figure 9. The activity timeline generated from a manually-instrumented version of NEMO. It contains a little initialisation and then data for a few time-steps midway through the run.

16 Profiling Parallel Performance Using Vampir and Paraver 16 The full trace data for the few timesteps may be loaded by selecting the relevant region from the summary time-line. Since the tracing has been programmatically switched on for a set number of timesteps, the information provided by the resulting summary may be reliably compared for different runs since it is not dependent on the area of the activity timeline selected by the user. Below we show an example of such a summary where the code has been manually instrumented. Figure 10. Summary view of trace data for five timesteps of a manually instrumented version of NEMO running on 128 processes. Once the trace data has been loaded, the user often wishes to view the global timeline, an example of which is shown below for a single timestep of NEMO. A useful summary of this view may be obtained by right-clicking and selecting Components->Parallelism Display. This brings up the display visible at the bottom of the figure from which it is easy to determine which sections of the timestep are dominated by e.g. MPI communications (coloured red by default in Vampir). An example here is the section of NEMO dealing with ice transport processes (coloured bright green). Also of note in this example is the dominance of global communications (coloured purple) over the last 16 processes. It turns out that these processes have been allocated the region of the globe in the vicinity of the poles and thus have extra work to do in removing noise introduced by the small mesh size in this region.

17 Profiling Parallel Performance Using Vampir and Paraver 17 Figure 11. A global timeline for each of 64 processors during a single timestep of NEMO. A 'parallelism' display is included at the bottom showing the percentage of the processors involved in each activity at any one time. The usefulness of the global timeline can be limited when looking at tracefiles for numbers of processors greater than 64 as Vampir will try to scale the data for each process so as to fit them all on screen. However, one can specify the number of process timelines one would like displayed at a time by right-clicking on the display and selecting Options->Show Subset... This brings up the Show Subset Dialog: Figure 12. The Show Subset Dialog for the global timeline. Use this to choose the number of processors ( Bars ) for which data is shown on the timeline.

18 Profiling Parallel Performance Using Vampir and Paraver 18 Using this dialog one can look at the application's behaviour in detail on a few processes or look at the overall behaviour on many processes. The figure below shows VIPAR displaying the activity of the majority of 128 processes during the section of the code dealing with ice rheology in NEMO. The effect of the LPARS on HPCx (effectively 16-way SMP nodes) on interprocess communications is highlighted by the fact that groups of 16 well-synchronised processes may be identified. Figure 13. The global timeline configured to show data for the majority of the 128 processors of the job. 4.6 Identifying Load Imbalances in the Development of PDSYEVR Profiling early versions of the new ScaLapack routine PDSYEVR with VampirTrace (VT) allows us to investigate its performance in detail. Basic timing analysis of the code revealed that load-balancing problems may exist for certain datasets in the eigenvector calculation stage of the underlying tridiagonal eigensolver MRRR. The Vampir analyses shown below enabled us to track this potential inefficiency with great precision.

19 Profiling Parallel Performance Using Vampir and Paraver 19 In order to track the code in more detail, here different colours were assigned to different functions using the syntax which is described in $VT_HOME/info/GROUPS.SPEC. Some additions to the underlying source code are required and a re-compilation must be undertaken. In the timeline view shown in Figure 14 the cyan areas are set to represent computation in the subroutine DLARRV. This routine is involved in the calculation of eigenvectors. As usual, time spent in communication is represented by the red areas in the timeline and the purple lines represent individual messages passed between processors. Figure 14. Vampir Timeline for original DLARRV subroutine The above timeline trace shows that when calculating half the subset of eigenvalues, the workload balance in DLARRV increases substantially from process 0 to process 14. This causes a large communication overhead, represented by the large red areas in the trace. Following this, it was determined that the load imbalance was primarily caused by an unequal division of eigenvectors amongst the processes. These problems were addressed by the Scalapack developers and a newer version of the code gave a much better division of workload, as can be seen in the timeline traces in Figure 15.

20 Profiling Parallel Performance Using Vampir and Paraver 20 Figure 15. Vampir Timeline for modified DLARRV subroutine 5 PARAVER performance analysis on HPCx 5.1 Setting up Paraver Tracing on HPCx Paraver uses the tool OMPItrace to generate tracefiles for OpenMP programs, MPI programs, or mixed-mode OpenMP and MPI programs. Users should note that OMPItrace currently only works with 32-bit executables on HPCx and also that OMPItrace uses IBM's DPCL (Dynamic Probe Class Library) which requires a.rhosts file in your home directory that lists all the processor ids on HPCx. Paraver tracefiles are generated on HPCx by adding the environment variables (in e.g. ksh/bash): export OMPITRACE_HOME=/usr/local/packages/paraver export MPTRACE_COUNTGROUP=60 to the Loadleveler job control script the poe command in the LoadLeveler scriptis changed from e.g. poe./prog to $OMPITRACE_HOME/bin/ompitrace -counters -v poe.real./prog

21 Profiling Parallel Performance Using Vampir and Paraver 21 On HPCx poe is in fact a wrapper to the real poe command. In order for OMPITRACE to function correctly on HPCx poe.real must be called directly. 5.2 Viewing Paraver tracefiles on HPCx The following environment variables should be set in the user s login session: export PARAVER_HOME = /usr/local/packages/paraver export MPTRACE_COUNTGROUP=60 During the run, Paraver will have created a temporary trace file for each process (*.mpit and *.sim files). After the run has completed the user must submit an instruction to pack the individual profile files into one global output. This is undertaken by issuing the command: $PARAVER_HOME/bin/ompi2prv *.mpit -s *.sym -o trace_prm.prv To view the resulting tracefile use the command: $PARAVER_HOME/bin/paraver trace_prm.prv 5.3 Analysing the LUS2 application using Paraver Unlike Vampir, upon starting Paraver, users are immediately shown the Global Timeline view. This parallelisation of LUS2 is based on OpenMP, therefore threads rather than processes are listed on the vertical axis against time on the horizontal axis. Increasing the zoom in a representative section of the trace shows:

22 Profiling Parallel Performance Using Vampir and Paraver 22 Figure 16. Paraver Timeline for two cycles of $OMP PARALLEL DO The default colours assigned represent the following activities: Figure 17. Colour properties in Paraver

23 Profiling Parallel Performance Using Vampir and Paraver 23 The trace in Figure 16 shows a typical slice of the timeline from lus2, where the code is undertaking $OMP PARALLEL DO construct across the matrix as described in section 3.3. It can be seen that relatively large swathes of blue, representing computation, are divided by thread administration tasks at the start and end of each $OMP PARALLEL DO cycle. Figure 18. Detailed view of OMP thread scheduling in LUS2

24 Profiling Parallel Performance Using Vampir and Paraver 24 In Figure 18, above the timeline bar of each thread is a series of green flags, each denoting a change of state in the thread. Clicking on the flag gives a detailed description as shown in the example above. Here it can be seen that thread 16 is firstly undergoing a global synchronisation before being scheduled to run the next cycle of the loop. 6 Summary Profilers can be highly effective tools in the analysis of parallel programs on HPC architectures. They are particularly useful for identifying and measuring the effect of such problems as communication bottlenecks and load imbalances on the efficiency of codes. New versions of these tools also include hardware performance data which facilitates the detailed analysis of serial processor performance within a parallel run. The Vampir and Paraver GUI-based analysis tools allow users to switch with ease from global analyses of the parallel run to very detailed analyses of specific messages, all within the one profiling session. Interoperability of VampirTrace with other profilers such as KOJAK and TAU has now been made possible due to the adoption of the opentracefile format. Acknowledgements The authors would like to thank Matthias Jurenz from TU Dresden, Chris Johnson from EPCC University of Edinburgh, and Ilian Todorov & Ian Bush from STFC Daresbury Laboratory for their help in creating this report. 7 References [1] Vampir Performance Optimization [2] Vampirtrace, ZIH, Technische Universitat, Dresden, [3] Paraver, The European Center for Parallelism of Barcelona, [4] The DL_POLY Simulation Package, W. Smith, STFC Daresbury Laboratory,

25 Profiling Parallel Performance Using Vampir and Paraver 25 [5] PDSYEVR. ScaLAPACK s parallel MRRR algorithm for the symmetric eigenvalue problem, D. Antonelli, C. Vomel, Lapack working note 168, (2005). [6] OMPtrace Tool User s Guide, [7] The OpenMP Application Program Interface, [8] NEMO - Nucleus for European Modelling of the Ocean, [9] KOJAK Automatic Performance Analysis Toolset, Forschungszentrum Julich, [10] TAU Tuning and Analysis Utilities, University of Oregon, [11] The European Center for Parallelism of Barcelona, [12] Science & Technology Facilities Council, [13] A Parallel Implementation of SPME for DL_POLY 3, I. J. Bush and W. Smith, STFC Daresbury Laboratory, [14] A Parallel Eigensolver for Dense Symmetric Matrices based on Multiple Relatively Robust Representations, P.Bientinesi, I.S.Dhillon, R.A.van de Geijn, UT CS Technical Report #TR-03026, (2003) [15] [16] PAPI Performance Application Programming Interface Appendix A The list of available PAPI hardware-counters on HPCx. Test case avail.c: Available events and hardware information Vendor string and code : IBM (-1)

26 Profiling Parallel Performance Using Vampir and Paraver 26 Model string and code : POWER5 (8192) CPU Revision : CPU Megahertz : CPU's in this Node : 16 Nodes in this System : 1 Total CPU's : 16 Number Hardware Counters : 6 Max Multiplex Counters : Name Code Avail Deriv Description (Note) PAPI_L1_DCM 0x Yes Yes Level 1 data cache misses () PAPI_L1_ICM 0x No No Level 1 instruction cache misses () PAPI_L2_DCM 0x Yes No Level 2 data cache misses () PAPI_L2_ICM 0x Yes No Level 2 instruction cache misses () PAPI_L3_DCM 0x Yes Yes Level 3 data cache misses () PAPI_L3_ICM 0x Yes Yes Level 3 instruction cache misses () PAPI_L1_TCM 0x No No Level 1 cache misses () PAPI_L2_TCM 0x No No Level 2 cache misses () PAPI_L3_TCM 0x No No Level 3 cache misses () PAPI_CA_SNP 0x No No Requests for a snoop () PAPI_CA_SHR 0x a No No Requests for exclusive access to shared cache line () PAPI_CA_CLN 0x b No No Requests for exclusive access to clean cache line () PAPI_CA_INV 0x c No No Requests for cache line invalidation () PAPI_CA_ITV 0x d No No Requests for cache line intervention () PAPI_L3_LDM 0x e Yes Yes Level 3 load misses () PAPI_L3_STM 0x f No No Level 3 store misses () PAPI_BRU_IDL 0x No No Cycles branch units are idle () PAPI_FXU_IDL 0x Yes No Cycles integer units are idle () PAPI_FPU_IDL 0x No No Cycles floating point units are idle () PAPI_LSU_IDL 0x No No Cycles load/store units are idle () PAPI_TLB_DM 0x Yes No Data translation lookaside buffer misses ()

27 Profiling Parallel Performance Using Vampir and Paraver 27 PAPI_TLB_IM 0x Yes No Instruction translation lookaside buffer misses () PAPI_TLB_TL 0x Yes Yes Total translation lookaside buffer misses () PAPI_L1_LDM 0x Yes No Level 1 load misses () PAPI_L1_STM 0x Yes No Level 1 store misses () PAPI_L2_LDM 0x Yes No Level 2 load misses () PAPI_L2_STM 0x a No No Level 2 store misses () PAPI_BTAC_M 0x b No No Branch target address cache misses () PAPI_PRF_DM 0x c No No Data prefetch cache misses () PAPI_L3_DCH 0x d No No Level 3 data cache hits () PAPI_TLB_SD 0x e No No Translation lookaside buffer shootdowns () PAPI_CSR_FAL 0x f No No Failed store conditional instructions () PAPI_CSR_SUC 0x No No Successful store conditional instructions () PAPI_CSR_TOT 0x No No Total store conditional instructions () PAPI_MEM_SCY 0x No No Cycles Stalled Waiting for memory accesses () PAPI_MEM_RCY 0x No No Cycles Stalled Waiting for memory Reads () PAPI_MEM_WCY 0x No No Cycles Stalled Waiting for memory writes () PAPI_STL_ICY 0x Yes No Cycles with no instruction issue () PAPI_FUL_ICY 0x No No Cycles with maximum instruction issue () PAPI_STL_CCY 0x No No Cycles with no instructions completed () PAPI_FUL_CCY 0x No No Cycles with maximum instructions completed () PAPI_HW_INT 0x Yes No Hardware interrupts () PAPI_BR_UCN 0x a No No Unconditional branch instructions () PAPI_BR_CN 0x b No No Conditional branch instructions () PAPI_BR_TKN 0x c No No Conditional branch instructions taken () PAPI_BR_NTK 0x d No No Conditional branch instructions not taken () PAPI_BR_MSP 0x e Yes Yes Conditional branch instructions mispredicted ()

28 Profiling Parallel Performance Using Vampir and Paraver 28 PAPI_BR_PRC 0x f No No Conditional branch instructions correctly predicted () PAPI_FMA_INS 0x Yes No FMA instructions completed () PAPI_TOT_IIS 0x Yes No Instructions issued () PAPI_TOT_INS 0x Yes No Instructions completed () PAPI_INT_INS 0x Yes No Integer instructions () PAPI_FP_INS 0x Yes No Floating point instructions () PAPI_LD_INS 0x Yes No Load instructions () PAPI_SR_INS 0x Yes No Store instructions () PAPI_BR_INS 0x Yes No Branch instructions () PAPI_VEC_INS 0x No No Vector/SIMD instructions () PAPI_RES_STL 0x No No Cycles stalled on any resource () PAPI_FP_STAL 0x a No No Cycles the FP unit(s) are stalled () PAPI_TOT_CYC 0x b Yes No Total cycles () PAPI_LST_INS 0x c Yes Yes Load/store instructions completed () PAPI_SYC_INS 0x d No No Synchronization instructions completed () PAPI_L1_DCH 0x e No No Level 1 data cache hits () PAPI_L2_DCH 0x f No No Level 2 data cache hits () PAPI_L1_DCA 0x Yes Yes Level 1 data cache accesses () PAPI_L2_DCA 0x No No Level 2 data cache accesses () PAPI_L3_DCA 0x No No Level 3 data cache accesses () PAPI_L1_DCR 0x Yes No Level 1 data cache reads () PAPI_L2_DCR 0x No No Level 2 data cache reads () PAPI_L3_DCR 0x Yes No Level 3 data cache reads () PAPI_L1_DCW 0x Yes No Level 1 data cache writes () PAPI_L2_DCW 0x No No Level 2 data cache writes () PAPI_L3_DCW 0x No No Level 3 data cache writes () PAPI_L1_ICH 0x Yes No Level 1 instruction cache hits () PAPI_L2_ICH 0x a No No Level 2 instruction cache hits () PAPI_L3_ICH 0x b No No Level 3 instruction cache hits () PAPI_L1_ICA 0x c No No Level 1 instruction cache accesses () PAPI_L2_ICA 0x d No No Level 2 instruction cache accesses () PAPI_L3_ICA 0x e Yes No Level 3 instruction cache accesses () PAPI_L1_ICR 0x f No No Level 1 instruction cache reads ()

29 Profiling Parallel Performance Using Vampir and Paraver 29 PAPI_L2_ICR 0x No No Level 2 instruction cache reads () PAPI_L3_ICR 0x No No Level 3 instruction cache reads () PAPI_L1_ICW 0x No No Level 1 instruction cache writes () PAPI_L2_ICW 0x No No Level 2 instruction cache writes () PAPI_L3_ICW 0x No No Level 3 instruction cache writes () PAPI_L1_TCH 0x No No Level 1 total cache hits () PAPI_L2_TCH 0x No No Level 2 total cache hits () PAPI_L3_TCH 0x No No Level 3 total cache hits () PAPI_L1_TCA 0x No No Level 1 total cache accesses () PAPI_L2_TCA 0x No No Level 2 total cache accesses () PAPI_L3_TCA 0x a No No Level 3 total cache accesses () PAPI_L1_TCR 0x b No No Level 1 total cache reads () PAPI_L2_TCR 0x c No No Level 2 total cache reads () PAPI_L3_TCR 0x d No No Level 3 total cache reads () PAPI_L1_TCW 0x e No No Level 1 total cache writes () PAPI_L2_TCW 0x f No No Level 2 total cache writes () PAPI_L3_TCW 0x No No Level 3 total cache writes () PAPI_FML_INS 0x No No Floating point multiply instructions () PAPI_FAD_INS 0x No No Floating point add instructions () PAPI_FDV_INS 0x Yes No Floating point divide instructions () PAPI_FSQ_INS 0x Yes No Floating point square root instructions () PAPI_FNV_INS 0x No No Floating point inverse instructions () PAPI_FP_OPS 0x Yes Yes Floating point operations () avail.c PASSED

PAPI Software Specification

PAPI Software Specification PAPI Software Specification This software specification describes the PAPI 3.0 Release, and is current as of March 08, 2004. It consists of the following sections: Introduction to PAPI Constants Standardized

More information

Dresden, September Dan Terpstra Jack Dongarra Shirley Moore. Heike Jagode

Dresden, September Dan Terpstra Jack Dongarra Shirley Moore. Heike Jagode Collecting Performance Data with PAPI-C 3rd Parallel Tools Workshop 3rd Parallel Tools Workshop Dresden, September 14-15 Dan Terpstra Jack Dongarra Shirley Moore Haihang You Heike Jagode Hardware performance

More information

PAPI Programmer s Reference

PAPI Programmer s Reference PAPI Programmer s Reference This document is a compilation of the reference material needed by a programmer to effectively use PAPI. It is identical to the material found in the PAPI man pages, but organized

More information

Performance analysis basics

Performance analysis basics Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis

More information

COSC 6374 Parallel Computation. Performance Oriented Software Design. Edgar Gabriel. Spring Amdahl s Law

COSC 6374 Parallel Computation. Performance Oriented Software Design. Edgar Gabriel. Spring Amdahl s Law COSC 6374 Parallel Computation Performance Oriented Software Design Spring 2008 Amdahl s Law Describes the performance gains by enhancing one part of the overall system (code, computer) Speedup = Performance

More information

PAPI Programmer s Reference

PAPI Programmer s Reference PAPI Programmer s Reference This document is a compilation of the reference material needed by a programmer to effectively use PAPI. It is identical to the material found in the PAPI man pages, but organized

More information

A Portable Programming Interface for Performance Evaluation on Modern Processors

A Portable Programming Interface for Performance Evaluation on Modern Processors A Portable Programming Interface for Performance Evaluation on Modern Processors S. Browne *, J Dongarra, N. Garner *, K. London *, P. Mucci * Abstract The purpose of the PAPI project is to specify a standard

More information

HiPERiSM Consulting, LLC.

HiPERiSM Consulting, LLC. HiPERiSM Consulting, LLC. George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill, NC 27514 george@hiperism.com http://www.hiperism.com Models-3 User s Conference September

More information

On the scalability of tracing mechanisms 1

On the scalability of tracing mechanisms 1 On the scalability of tracing mechanisms 1 Felix Freitag, Jordi Caubet, Jesus Labarta Departament d Arquitectura de Computadors (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat Politècnica

More information

Performance Metrics for Ocean and Air Quality Models on Commodity Linux Platforms

Performance Metrics for Ocean and Air Quality Models on Commodity Linux Platforms Performance Metrics for Ocean and Air Quality Models on Commodity Linux Platforms George Delic george@hiperism.com HiPERiSM Consulting, LLC Durham, North Carolina Abstract. This report examines performance

More information

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW 8th VI-HPS Tuning Workshop at RWTH Aachen September, 2011 Tobias Hilbrich and Joachim Protze Slides by: Andreas Knüpfer, Jens Doleschal, ZIH, Technische Universität

More information

[Scalasca] Tool Integrations

[Scalasca] Tool Integrations Mitglied der Helmholtz-Gemeinschaft [Scalasca] Tool Integrations Aug 2011 Bernd Mohr CScADS Performance Tools Workshop Lake Tahoe Contents Current integration of various direct measurement tools Paraver

More information

Ateles performance assessment report

Ateles performance assessment report Ateles performance assessment report Document Information Reference Number Author Contributor(s) Date Application Service Level Keywords AR-4, Version 0.1 Jose Gracia (USTUTT-HLRS) Christoph Niethammer,

More information

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE Tutorial: Analyzing MPI Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Contents Legal Information... 3 1. Overview... 4 1.1. Prerequisites... 5 1.1.1. Required Software... 5 1.1.2.

More information

Performance Analysis with Vampir

Performance Analysis with Vampir Performance Analysis with Vampir Johannes Ziegenbalg Technische Universität Dresden Outline Part I: Welcome to the Vampir Tool Suite Event Trace Visualization The Vampir Displays Vampir & VampirServer

More information

SCIENTIFIC COMPUTING FOR ENGINEERS

SCIENTIFIC COMPUTING FOR ENGINEERS 4/26/16 CS 594: SCIENTIFIC COMPUTING FOR ENGINEERS PAPI Performance Application Programming Interface Heike Jagode jagode@icl.utk.edu OUTLINE 1. Motivation What is Performance? Why being annoyed with Performance

More information

Performance Analysis of Parallel Scientific Applications In Eclipse

Performance Analysis of Parallel Scientific Applications In Eclipse Performance Analysis of Parallel Scientific Applications In Eclipse EclipseCon 2015 Wyatt Spear, University of Oregon wspear@cs.uoregon.edu Supercomputing Big systems solving big problems Performance gains

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

LAPI on HPS Evaluating Federation

LAPI on HPS Evaluating Federation LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of

More information

A Trace-Scaling Agent for Parallel Application Tracing 1

A Trace-Scaling Agent for Parallel Application Tracing 1 A Trace-Scaling Agent for Parallel Application Tracing 1 Felix Freitag, Jordi Caubet, Jesus Labarta Computer Architecture Department (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat

More information

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware Parallelism V HPC Profiling John Cavazos Dept of Computer & Information Sciences University of Delaware Lecture Overview Performance Counters Profiling PAPI TAU HPCToolkit PerfExpert Performance Counters

More information

COMP Superscalar. COMPSs Tracing Manual

COMP Superscalar. COMPSs Tracing Manual COMP Superscalar COMPSs Tracing Manual Version: 2.4 November 9, 2018 This manual only provides information about the COMPSs tracing system. Specifically, it illustrates how to run COMPSs applications with

More information

Introducing OTF / Vampir / VampirTrace

Introducing OTF / Vampir / VampirTrace Center for Information Services and High Performance Computing (ZIH) Introducing OTF / Vampir / VampirTrace Zellescher Weg 12 Willers-Bau A115 Tel. +49 351-463 - 34049 (Robert.Henschel@zih.tu-dresden.de)

More information

PAPI Performance API. Shirley Moore 8th VI-HPS Tuning Workshop 5-9 September 2011

PAPI Performance API. Shirley Moore 8th VI-HPS Tuning Workshop 5-9 September 2011 PAPI Performance API Shirley Moore shirley@eecs.utk.edu 8th VI-HPS Tuning Workshop 5-9 September 2011 PAPI Team Vince Weaver Post Doc Kiran Kasichayanula Masters Student Jack Dongarra Distinguished Prof.

More information

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

Tools and techniques for optimization and debugging. Fabio Affinito October 2015 Tools and techniques for optimization and debugging Fabio Affinito October 2015 Profiling Why? Parallel or serial codes are usually quite complex and it is difficult to understand what is the most time

More information

Detection and Analysis of Iterative Behavior in Parallel Applications

Detection and Analysis of Iterative Behavior in Parallel Applications Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University

More information

Automatic trace analysis with the Scalasca Trace Tools

Automatic trace analysis with the Scalasca Trace Tools Automatic trace analysis with the Scalasca Trace Tools Ilya Zhukov Jülich Supercomputing Centre Property Automatic trace analysis Idea Automatic search for patterns of inefficient behaviour Classification

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr

Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr Research Centre Juelich (FZJ) John von Neumann Institute of Computing (NIC) Central Institute for Applied Mathematics (ZAM) 52425

More information

Performance Analysis of AERMOD on Commodity Platforms

Performance Analysis of AERMOD on Commodity Platforms Performance Analysis of AERMOD on Commodity Platforms George Delic george@hiperism.com HiPERiSM Consulting, LLC Durham, North Carolina Abstract. This report examines performance of the AERMOD Air Quality

More information

Visual Profiler. User Guide

Visual Profiler. User Guide Visual Profiler User Guide Version 3.0 Document No. 06-RM-1136 Revision: 4.B February 2008 Visual Profiler User Guide Table of contents Table of contents 1 Introduction................................................

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Introduction to Parallel Performance Engineering

Introduction to Parallel Performance Engineering Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance:

More information

VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING. BSC Tools Hands-On. Germán Llort, Judit Giménez. Barcelona Supercomputing Center

VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING. BSC Tools Hands-On. Germán Llort, Judit Giménez. Barcelona Supercomputing Center BSC Tools Hands-On Germán Llort, Judit Giménez Barcelona Supercomputing Center 2 VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING Getting a trace with Extrae Extrae features Platforms Intel, Cray, BlueGene,

More information

ARCHER Single Node Optimisation

ARCHER Single Node Optimisation ARCHER Single Node Optimisation Profiling Slides contributed by Cray and EPCC What is profiling? Analysing your code to find out the proportion of execution time spent in different routines. Essential

More information

PAPI - PERFORMANCE API. ANDRÉ PEREIRA

PAPI - PERFORMANCE API. ANDRÉ PEREIRA PAPI - PERFORMANCE API ANDRÉ PEREIRA ampereira@di.uminho.pt 1 Motivation Application and functions execution time is easy to measure time gprof valgrind (callgrind) It is enough to identify bottlenecks,

More information

In examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured

In examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured System Performance Analysis Introduction Performance Means many things to many people Important in any design Critical in real time systems 1 ns can mean the difference between system Doing job expected

More information

Vampir 8 User Manual

Vampir 8 User Manual Vampir 8 User Manual Copyright c 2013 GWT-TUD GmbH Blasewitzer Str. 43 01307 Dresden, Germany http://gwtonline.de Support / Feedback / Bugreports Please provide us feedback! We are very interested to hear

More information

PAPI Performance Application Programming Interface (adapted by Fengguang Song)

PAPI Performance Application Programming Interface (adapted by Fengguang Song) 1/17/18 PAPI Performance Application Programming Interface (adapted by Fengguang Song) Heike McCraw mccraw@icl.utk.edu To get more details, please read the manual: http://icl.cs.utk.edu/projects/papi/wiki/papi3:

More information

Performance Profiling

Performance Profiling Performance Profiling Minsoo Ryu Real-Time Computing and Communications Lab. Hanyang University msryu@hanyang.ac.kr Outline History Understanding Profiling Understanding Performance Understanding Performance

More information

Transactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN

Transactions on Information and Communications Technologies vol 3, 1993 WIT Press,   ISSN The implementation of a general purpose FORTRAN harness for an arbitrary network of transputers for computational fluid dynamics J. Mushtaq, A.J. Davies D.J. Morgan ABSTRACT Many Computational Fluid Dynamics

More information

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP

More information

SC12 HPC Educators session: Unveiling parallelization strategies at undergraduate level

SC12 HPC Educators session: Unveiling parallelization strategies at undergraduate level SC12 HPC Educators session: Unveiling parallelization strategies at undergraduate level E. Ayguadé, R. M. Badia, D. Jiménez, J. Labarta and V. Subotic August 31, 2012 Index Index 1 1 The infrastructure:

More information

Introduction to Performance Tuning & Optimization Tools

Introduction to Performance Tuning & Optimization Tools Introduction to Performance Tuning & Optimization Tools a[i] a[i+1] + a[i+2] a[i+3] b[i] b[i+1] b[i+2] b[i+3] = a[i]+b[i] a[i+1]+b[i+1] a[i+2]+b[i+2] a[i+3]+b[i+3] Ian A. Cosden, Ph.D. Manager, HPC Software

More information

Vampir 8 User Manual

Vampir 8 User Manual Vampir 8 User Manual Copyright c 2012 GWT-TUD GmbH Blasewitzer Str. 43 01307 Dresden, Germany http://gwtonline.de Support / Feedback / Bugreports Please provide us feedback! We are very interested to hear

More information

PAPI - PERFORMANCE API. ANDRÉ PEREIRA

PAPI - PERFORMANCE API. ANDRÉ PEREIRA PAPI - PERFORMANCE API ANDRÉ PEREIRA ampereira@di.uminho.pt 1 Motivation 2 Motivation Application and functions execution time is easy to measure time gprof valgrind (callgrind) 2 Motivation Application

More information

SCALASCA parallel performance analyses of SPEC MPI2007 applications

SCALASCA parallel performance analyses of SPEC MPI2007 applications Mitglied der Helmholtz-Gemeinschaft SCALASCA parallel performance analyses of SPEC MPI2007 applications 2008-05-22 Zoltán Szebenyi Jülich Supercomputing Centre, Forschungszentrum Jülich Aachen Institute

More information

Recent Advances in the Performance API (PAPI)

Recent Advances in the Performance API (PAPI) Recent Advances in the Performance API (PAPI) Collaborators: Heike Jagode Asim Yarkhan Jack Dongarra 10 th Scalable Tools Workshop Anthony Danalis Lake Tahoe, California August 1-4, 2016 PAPI Middleware

More information

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ, Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon

More information

Manual SmartGraph for Humlog 10

Manual SmartGraph for Humlog 10 Manual SmartGraph for Humlog 10 State: 10.12.2001 Version: V1.0 1 1 INTRODUCTION TO SMARTGRAPH... 4 1.1 Manage, Configure... 4 1.2 The Programme Interface... 4 1.2.1 Graphs... 5 1.2.2 Table... 6 1.2.3

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

SCALASCA v1.0 Quick Reference

SCALASCA v1.0 Quick Reference General SCALASCA is an open-source toolset for scalable performance analysis of large-scale parallel applications. Use the scalasca command with appropriate action flags to instrument application object

More information

An Implementation of the POMP Performance Monitoring for OpenMP based on Dynamic Probes

An Implementation of the POMP Performance Monitoring for OpenMP based on Dynamic Probes An Implementation of the POMP Performance Monitoring for OpenMP based on Dynamic Probes Luiz DeRose IBM Research ACTC Yorktown Heights, NY USA laderose@us.ibm.com Bernd Mohr Forschungszentrum Jülich ZAM

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1 Score-P SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1 Score-P Functionality Score-P is a joint instrumentation and measurement system for a number of PA tools. Provide

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Performance Tools Hands-On. PATC Apr/2016.

Performance Tools Hands-On. PATC Apr/2016. Performance Tools Hands-On PATC Apr/2016 tools@bsc.es Accounts Users: nct010xx Password: f.23s.nct.0xx XX = [ 01 60 ] 2 Extrae features Parallel programming models MPI, OpenMP, pthreads, OmpSs, CUDA, OpenCL,

More information

Business Intelligence and Reporting Tools

Business Intelligence and Reporting Tools Business Intelligence and Reporting Tools Release 1.0 Requirements Document Version 1.0 November 8, 2004 Contents Eclipse Business Intelligence and Reporting Tools Project Requirements...2 Project Overview...2

More information

The Role of Performance

The Role of Performance Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware

More information

Point-to-Point Synchronisation on Shared Memory Architectures

Point-to-Point Synchronisation on Shared Memory Architectures Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Prof. Thomas Sterling

Prof. Thomas Sterling High Performance Computing: Concepts, Methods & Means Performance Measurement 1 Prof. Thomas Sterling Department of Computer Science Louisiana i State t University it February 13 th, 2007 News Alert! Intel

More information

Profiling: Understand Your Application

Profiling: Understand Your Application Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel

More information

Performance Analysis with Vampir

Performance Analysis with Vampir Performance Analysis with Vampir Ronald Geisler, Holger Brunst, Bert Wesarg, Matthias Weber, Hartmut Mix, Ronny Tschüter, Robert Dietrich, and Andreas Knüpfer Technische Universität Dresden Outline Part

More information

Evaluation of Profiling Tools for the Acquisition of Time Independent Traces

Evaluation of Profiling Tools for the Acquisition of Time Independent Traces Evaluation of Profiling Tools for the Acquisition of Time Independent Traces Frédéric Desprez, George S. Markomanolis, Frédéric Suter TECHNICAL REPORT N 437 July 2013 Project-Team AVALON ISSN 0249-0803

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

MPI Performance Tools

MPI Performance Tools Physics 244 31 May 2012 Outline 1 Introduction 2 Timing functions: MPI Wtime,etime,gettimeofday 3 Profiling tools time: gprof,tau hardware counters: PAPI,PerfSuite,TAU MPI communication: IPM,TAU 4 MPI

More information

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1 LAPACK Linear Algebra PACKage 1 Janice Giudice David Knezevic 1 Motivating Question Recalling from last week... Level 1 BLAS: vectors ops Level 2 BLAS: matrix-vectors ops 2 2 O( n ) flops on O( n ) data

More information

VAMPIR & VAMPIRTRACE Hands On

VAMPIR & VAMPIRTRACE Hands On VAMPIR & VAMPIRTRACE Hands On PRACE Spring School 2012 in Krakow May, 2012 Holger Brunst Slides by: Andreas Knüpfer, Jens Doleschal, ZIH, Technische Universität Dresden Hands-on: NPB Build Copy NPB sources

More information

Batch Jobs Performance Testing

Batch Jobs Performance Testing Batch Jobs Performance Testing October 20, 2012 Author Rajesh Kurapati Introduction Batch Job A batch job is a scheduled program that runs without user intervention. Corporations use batch jobs to automate

More information

Analyzing I/O Performance on a NEXTGenIO Class System

Analyzing I/O Performance on a NEXTGenIO Class System Analyzing I/O Performance on a NEXTGenIO Class System holger.brunst@tu-dresden.de ZIH, Technische Universität Dresden LUG17, Indiana University, June 2 nd 2017 NEXTGenIO Fact Sheet Project Research & Innovation

More information

Vampir 9 User Manual

Vampir 9 User Manual Vampir 9 User Manual Copyright c 2018 GWT-TUD GmbH Freiberger Str. 33 01067 Dresden, Germany http://gwtonline.de Support / Feedback / Bug Reports Please provide us feedback! We are very interested to hear

More information

( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems

( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems ( ZIH ) Center for Information Services and High Performance Computing Event Tracing and Visualization for Cell Broadband Engine Systems ( daniel.hackenberg@zih.tu-dresden.de ) Daniel Hackenberg Cell Broadband

More information

Overview. Timers. Profilers. HPM Toolkit

Overview. Timers. Profilers. HPM Toolkit Overview Timers Profilers HPM Toolkit 2 Timers Wide range of timers available on the HPCx system Varying precision portability language ease of use 3 Timers Timer Usage Wallclock/C PU Resolution Language

More information

Using VTK and the OpenGL Graphics Libraries on HPCx

Using VTK and the OpenGL Graphics Libraries on HPCx Using VTK and the OpenGL Graphics Libraries on HPCx Jeremy Nowell EPCC The University of Edinburgh Edinburgh EH9 3JZ Scotland, UK April 29, 2005 Abstract Some of the graphics libraries and visualisation

More information

Performance Analysis of the MPAS-Ocean Code using HPCToolkit and MIAMI

Performance Analysis of the MPAS-Ocean Code using HPCToolkit and MIAMI Performance Analysis of the MPAS-Ocean Code using HPCToolkit and MIAMI Gabriel Marin February 11, 2014 MPAS-Ocean [4] is a component of the MPAS framework of climate models. MPAS-Ocean is an unstructured-mesh

More information

Hybrid Programming with MPI and SMPSs

Hybrid Programming with MPI and SMPSs Hybrid Programming with MPI and SMPSs Apostolou Evangelos August 24, 2012 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2012 Abstract Multicore processors prevail

More information

The PAPI Cross-Platform Interface to Hardware Performance Counters

The PAPI Cross-Platform Interface to Hardware Performance Counters The PAPI Cross-Platform Interface to Hardware Performance Counters Kevin London, Shirley Moore, Philip Mucci, and Keith Seymour University of Tennessee-Knoxville {london, shirley, mucci, seymour}@cs.utk.edu

More information

Performance Analysis for Large Scale Simulation Codes with Periscope

Performance Analysis for Large Scale Simulation Codes with Periscope Performance Analysis for Large Scale Simulation Codes with Periscope M. Gerndt, Y. Oleynik, C. Pospiech, D. Gudu Technische Universität München IBM Deutschland GmbH May 2011 Outline Motivation Periscope

More information

PCAN-Explorer 6. Tel: Professional Windows Software to Communicate with CAN and CAN FD Busses. Software >> PC Software

PCAN-Explorer 6. Tel: Professional Windows Software to Communicate with CAN and CAN FD Busses. Software >> PC Software PCAN-Explorer 6 Professional Windows Software to Communicate with CAN and CAN FD Busses The PCAN-Explorer 6 is a versatile, professional program for working with CAN and CAN FD networks. The user is not

More information

!OMP #pragma opm _OPENMP

!OMP #pragma opm _OPENMP Advanced OpenMP Lecture 12: Tips, tricks and gotchas Directives Mistyping the sentinel (e.g.!omp or #pragma opm ) typically raises no error message. Be careful! The macro _OPENMP is defined if code is

More information

NEXTGenIO Performance Tools for In-Memory I/O

NEXTGenIO Performance Tools for In-Memory I/O NEXTGenIO Performance Tools for In- I/O holger.brunst@tu-dresden.de ZIH, Technische Universität Dresden 22 nd -23 rd March 2017 Credits Intro slides by Adrian Jackson (EPCC) A new hierarchy New non-volatile

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

Performance Analysis with Vampir. Joseph Schuchart ZIH, Technische Universität Dresden

Performance Analysis with Vampir. Joseph Schuchart ZIH, Technische Universität Dresden Performance Analysis with Vampir Joseph Schuchart ZIH, Technische Universität Dresden 1 Mission Visualization of dynamics of complex parallel processes Full details for arbitrary temporal and spatial levels

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection Numerical Libraries in the DOE ACTS Collection The DOE ACTS Collection SIAM Parallel Processing for Scientific Computing, Savannah, Georgia Feb 15, 2012 Tony Drummond Computational Research Division Lawrence

More information

Using Java for Scientific Computing. Mark Bul EPCC, University of Edinburgh

Using Java for Scientific Computing. Mark Bul EPCC, University of Edinburgh Using Java for Scientific Computing Mark Bul EPCC, University of Edinburgh markb@epcc.ed.ac.uk Java and Scientific Computing? Benefits of Java for Scientific Computing Portability Network centricity Software

More information

Parallel Performance and Optimization

Parallel Performance and Optimization Parallel Performance and Optimization Gregory G. Howes Department of Physics and Astronomy University of Iowa Iowa High Performance Computing Summer School University of Iowa Iowa City, Iowa 25-26 August

More information

Parallel Performance Analysis Using the Paraver Toolkit

Parallel Performance Analysis Using the Paraver Toolkit Parallel Performance Analysis Using the Paraver Toolkit Parallel Performance Analysis Using the Paraver Toolkit [16a] [16a] Slide 1 University of Stuttgart High-Performance Computing Center Stuttgart (HLRS)

More information

Integrating Parallel Application Development with Performance Analysis in Periscope

Integrating Parallel Application Development with Performance Analysis in Periscope Technische Universität München Integrating Parallel Application Development with Performance Analysis in Periscope V. Petkov, M. Gerndt Technische Universität München 19 April 2010 Atlanta, GA, USA Motivation

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

VAMPIR & VAMPIRTRACE Hands On

VAMPIR & VAMPIRTRACE Hands On VAMPIR & VAMPIRTRACE Hands On 8th VI-HPS Tuning Workshop at RWTH Aachen September, 2011 Tobias Hilbrich and Joachim Protze Slides by: Andreas Knüpfer, Jens Doleschal, ZIH, Technische Universität Dresden

More information

Mixed Mode MPI / OpenMP Programming

Mixed Mode MPI / OpenMP Programming Mixed Mode MPI / OpenMP Programming L.A. Smith Edinburgh Parallel Computing Centre, Edinburgh, EH9 3JZ 1 Introduction Shared memory architectures are gradually becoming more prominent in the HPC market,

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Improving Applica/on Performance Using the TAU Performance System

Improving Applica/on Performance Using the TAU Performance System Improving Applica/on Performance Using the TAU Performance System Sameer Shende, John C. Linford {sameer, jlinford}@paratools.com ParaTools, Inc and University of Oregon. April 4-5, 2013, CG1, NCAR, UCAR

More information

ISC 09 Poster Abstract : I/O Performance Analysis for the Petascale Simulation Code FLASH

ISC 09 Poster Abstract : I/O Performance Analysis for the Petascale Simulation Code FLASH ISC 09 Poster Abstract : I/O Performance Analysis for the Petascale Simulation Code FLASH Heike Jagode, Shirley Moore, Dan Terpstra, Jack Dongarra The University of Tennessee, USA [jagode shirley terpstra

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information