Profiling Parallel Performance using Vampir and Paraver
|
|
- Virgil Robinson
- 5 years ago
- Views:
Transcription
1 Profiling Parallel Performance using Vampir and Paraver Andrew Sunderland, Andrew Porter STFC Daresbury Laboratory, Warrington, WA4 4AD Abstract Two popular parallel profiling tools installed on HPCx are Vampir and Paraver, which are also widely available on other platforms. These tools can simultaneously monitor hardware counters and track message-passing calls, providing valuable information on an application's runtime behaviour which can be used to improve its performance. In this report we look at using these tools in practice on a number of different applications on HPCx, with the aim of showing users how to utilise such profilers to help them understand the behaviour of their own codes. As part of this, we also examine the use of Vampir for codes run on large numbers of processes (64 or more). Interested parties should check back here regularly for updates to this paper. This is a Technical Report from the HPCx Consortium. Report available from HPCx UoE Ltd 2007 Neither HPCx UoE Ltd nor its members separately accept any responsibility for loss or damage arising from the use of information contained in any of their reports or in any communication about their tests or investigations.
2 Profiling Parallel Performance Using Vampir and Paraver ii 1 Introduction 3 2 Background to Profilers VAMPIR & VAMPIRTRACE Product History PARAVER 4 3 Background to Applications DL-POLY NEMO PDSYEVR LU decomposition using OpenMP 6 4 VAMPIR Performance Analysis on HPCx Installation VampirTrace version Vampir Tracing the Application Code on HPCx Automatic Instrumentation Manual Instrumentation using the VampirTrace API Running the application with tracing on HPCx Hardware Event Counter Monitoring with PAPI Analysing DL_POLY VampirTrace files with Vampir Vampir Summary Chart Vampir Activity Chart Global Timeline View Analysing parallel 3D FFT performance in DL_POLY Profiling the NEMO application on large process counts using Vampir on HPCx Identifying Load Imbalances in the Development of PDSYEVR 18 5 PARAVER performance analysis on HPCx Setting up Paraver Tracing on HPCx Viewing Paraver tracefiles on HPCx Analysing the LUS2 application using Paraver 21 6 Summary 24 7 References 24
3 Profiling Parallel Performance Using Vampir and Paraver 3 1 Introduction The performance of a parallel code is commonly dependent on a complex combination of factors. It is therefore important that developers of High Performance Computing applications have access to effective tools for collecting and analysing performance data. This data can be used to identify such issues as computational and communication bottlenecks, load imbalances and inefficient cpu utilization. In this report we investigate the use of Vampir (Visualization and Analysis of MPI Resources) [1] in association with its related tracing tool VampirTrace [2] and Paraver (Parallel Program and Visualization Analysis Tool) [3]. HPCx usage is demonstrated here by applying the tools to the parallel DL_POLY 3 [4] application code, the computation core of a new symmetric parallel eigensolver PDSYEVR [5], an LU decomposition code [6] parallelised using OpenMP [7] and the NEMO oceanmodelling code ARPDBG [8]. It is not intended that this report should be referenced as a user guide for the tools investigated. For this there are excellent documents that can be found at the respective tools websites that detail the huge amount of features available. Rather this report is intend to give users a quick introduction to getting started with using the tools on HPCx and to demonstrate, with the aid of application examples, some of the in-depth analysis that can be enabled. 2 Background to Profilers Both analysis tools involve similar approaches i.e. analysis of a specific tracefile created at the application s runtime that contains information on the various calls and events undertaken. For tracing the application code VampirTrace requires a relinking of the application code to the VampirTrace libraries whereas Paraver-based tracing does not require any relinking of the code, only execution via the OMPItrace tool. Both VampirTrace, OMPItrace can produce a tracefile for an OpenMP program, an MPI program, or a mixed-mode OpenMP and MPI program. Both tools require licenses and environment variable settings can be used to customize the tracing events that are to be recorded. 2.1 VAMPIR & VAMPIRTRACE Vampir (Visualisation and Analysis of MPI Resources) [1] is a commercial postmortem trace visualisation tool from the Center for Information Services and High Performance Computing (ZIH) of TU Dresden [2]. In collaboration with the KOJAK project at ZAM/FZ Jülich [9], the freely available VampirTrace is obtainable from the same organization. The tool uses profiling extensions to MPI and permits analysis of the message events where data is passed between processors during execution of a parallel program. Event ordering, message lengths and times can all be analysed. The latest version (5.0) features support for OpenMP events and hardware performance counters. The tool comes in two components - VampirTrace and Vampir. The first of these
4 Profiling Parallel Performance Using Vampir and Paraver 4 includes a library which when linked and called from a parallel program, produces an event tracefile. Common events include the entering and leaving of function calls and the sending and receiving of MPI messages. By using keywords, application-specific information can be built into the trace using subroutine calls. Trace calls can be automatically applied to the whole run-time or manually added around time-critical program sections. This involves adding calls to VT_USER_START ( label ) and VT_USER_END ( label ) at the section of interest in the source. Automatic instrumentation requires only a re-link of the application code with the VT libraries, whilst manual instrumentation requires a re-compilation of the program. Then Vampir itself is used to convert the trace information into a variety of graphical views, e.g. timeline displays showing state changes and communication, profiling statistics displaying the execution times of routines, communication statistics indicating volumes and transmission rates and more Product History The Vampir tool has been developed at the Center for Applied Mathematics of Research Center Jülich and the Center for High Performance Computing of the Technische Universität Dresden. Vampir is available as a commercial product since 1996 and has been enhanced in the scope of many research and development projects. In the past, it was distributed by the German Pallas GmbH which became later a part of Intel Corporation. The cooperation with Intel ended in Vampir has been widely used in the high performance computing community for many years. A growing number of performance monitoring environments such as TAU [10], KOJAK [9], can produce tracefiles that are now readable by Vampir. Since the release of version 5.0, Vampir supports the new Open Trace Format (OTF), also developed by ZIH. This trace format is especially designed for massively parallel programs. Vampir is portable across many platforms due to its X-based graphical user interface and is available for many computing platforms. 2.2 PARAVER The Paraver performance analysis tool is developed by The European Center for Parallelism of Barcelona (CEPBA) [11] at the Technical University of Catalonia. Based on an easy-to-use Motif GUI, Paraver has been developed to respond to the need to have a qualitative global perception of the application behaviour by visual inspection and then to be able to focus on the detailed quantitative analysis of the problems. Paraver provides a large amount of information useful to improve the decisions on whether and where to invert the programming effort to optimize an application.
5 Profiling Parallel Performance Using Vampir and Paraver 5 3 Background to Applications 3.1 DL-POLY 3 DL_POLY [4] is a parallel molecular dynamics simulation package developed at STFC's Daresbury Laboratory [12]. DL_POLY 3 is the most recent version (2001) and exploits a linked cell algorithm for domain decomposition, suitable for very large systems (up to order 1,000,000 particles) of reasonably uniform density. Computationally the code is characterised by a series of timestep calculations involving exchanges of short-range forces between particles and long-range forces between domains using 3 dimensional FFTs. The computation of these 3D FFTs [13] are a major expense during the computation. Depending on the general integration flavour a DL_POLY_3 timestep can be considered to comprise of the following stages: integration part 1, particle exchange, halo reconstruction, force evaluation, integration part 2. The most communication expensive operation is the particle exchange stage since it involves recovery of the topology of bonded interactions for particles crossing domains. Metal interactions are evaluated by using tabulated data and involve a halo exchange data as they depend on the local density. The test case examined here is a molecular simulation involving dipalmitoylphosphatidylcholine (DPPC) in water. This system is of interest due to its complex forcefield, containing many bonded interactions including constraints as well as vdw and Coulomb charges. 3.2 NEMO NEMO (Nucleus for European Modelling of the Ocean) [8] is designed for the simulation of both regional and global ocean circulation and is developed at the Laboratoire d'océanographie Dynamique et de Climatologie at the Institut Pierre Simon Laplace. It solves a primitive-equation model of the ocean system in three dimensions using a finite-difference scheme and contains sea-ice and passive-tracer models. Originally designed for vector machines, the most recent version uses MPI in its MPP implementation. Here we discuss how VAMPIR may be used to analyse a code's performance on processor counts of up to 256 using NEMO as an example. 3.3 PDSYEVR In the 90s, Dhillon and Parlett devised a new algorithm (Multiple Relatively Robust Representations MRRR) [14] for computing numerically orthogonal eigenvectors of a symmetric tridiagonal matrix with O(n 2 ) cost. Recently a ScaLAPACK [15] implementation of this algorithm named PDSYEVR has been developed and it is planned that this routine will be incorporated into future releases of ScaLAPACK. Analysis of some of the subroutines from initial versions of this code with Vampir helped identify performance issues on HPCx, which were later rectified by the developers.
6 Profiling Parallel Performance Using Vampir and Paraver LU decomposition using OpenMP LUS2 is a short Fortran program that calculates an LU decomposition on a dense matrix. Parallelisation of the LU algorithm is achieved by using OpenMP Fortran interface directives, in particular PARALLEL DO loop directives as in the construct that loops through the rows and columns of a matrix as shown below: C$OMP PARALLEL DO SCHEDULE(DYNAMIC,16), PRIVATE(j) do i=1, ISIZE do j=1, ISIZE D(i,j) = A(i,j) + B(i,j) enddo enddo C$OMP END PARALLEL DO 4 VAMPIR Performance Analysis on HPCx 4.1 Installation VampirTrace version The source files for VampirTrace version can be downloaded free of charge from: (search for VampirTrace from the home page). In order to install a 64-bit version of VampirTrace on HPCx the following compiler options were used:./configure AR="ar -X32_64" CC=xlc_r CXX=xlC_r F77=xlf_r FC=xlf90_r MPICC=mpcc_r CFLAGS="-O2 -g -q64" CXXFLAGS="-O2 -g -q64" FFLAGS="-O2 -g -q64" FCFLAGS="-O2 -g -q64" The following configuration options were also required in order to link to IBM s Parallel Operating System (poe), IBM s Message Passing Interface (MPI) library and to access hardware event counter monitoring via the Performance Application Programming Interface (PAPI): --with-mpi-inc-dir=/usr/lpp/ppe.poe/include
7 Profiling Parallel Performance Using Vampir and Paraver 7 --with-mpi-lib-dir=/usr/lpp/ppe.poe/lib --with-mpi-lib=-lmpi_r --with-papidir=/usr/local/packages/papi/papi bit --with-papi-lib="-lpapi64 -lpmapi" Vampir A pre-compiled binary of Vampir 5.0 for AIX is available for download from the Vampir website: NB: This download is a demonstration copy only and a permanent Vampir 5.0 installation is at present unavailable to users on HPCx. Vampir 5.0 is a GUI-based product and therefore it is intended for users to provide their own copy of Vampir 5.0 installed on their remote platforms. This can then be used to view tracefiles of parallel runs from HPCx locally. However a fully featured permanent copy of Vampir 4.3 is installed on HPCx. Users should also note that previous versions of Vampir cannot read tracefiles obtained from Vampirtrace 5.0 as they are incompatible with the new otf (open tracefile format). 4.2 Tracing the Application Code on HPCx In order to use the VampirTrace libraries a) calls to switch Vampirtrace on/off are made from the source code (optional) b) the code must be relinked to the VT libraries c) the code is then run (in the normal way under poe) on HPCx Automatic Instrumentation Automatic Instrumentation is the most convenient way to instrument your application. Simply use the special VT compiler wrappers, found in the $VAMPIRTRACE_HOME/bin subdirectory, without any parameters, e.g.: vtf90 prog1.f90 prog2.f90 -o prog In this case the appropriate VT libraries will automatically be linked into the executable and tracing will be applied to the whole executable Manual Instrumentation using the VampirTrace API The VT USER START, VT USER END instrumentation calls can be used to mark any user-defined sequence of statements. Fortran: #include "vt_user.inc"
8 Profiling Parallel Performance Using Vampir and Paraver 8 VT_USER_START( name )... VT_USER_END( name ) C: #include "vt_user.h" VT_USER_START("name");... VT_USER_END("name"); A unique label should be applied as name in order to identify the different sections traced. If a block has several exit points (as it is often the case for functions), all exit points have to be instrumented by VT USER END. The code can then be compiled using the VT compiler wrappers (e.g. vtf90, vtcc) as described above. This approach is particularly advantageous if users wish to profile certain sections of the application code and leave other parts untraced. A selective tracing approach can also reduce the size of the resulting tracefiles considerably which in turn can speed up loading times when it comes to analyzing them in Vampir Running the application with tracing on HPCx The code can then be run in the usual way on HPCx using poe through a loadleveler script. Upon completion a series of tracefiles are produced a numbered *.filt and *.events.z for each process used and a global *.def.z and *.otf file Hardware Event Counter Monitoring with PAPI In order to direct VampirTrace to collect hardware event counter data a $VT_METRICS environment variable must be set in the loadlever job command script specifying which counters should be monitored. A list of all counters supported by the Performance Application Programming Interface (PAPI) [16] on HPCx can be generated by running the tool 'papi_avail' in the /usr/local/packages/papi/papi bit/share/papi/utils/ directory. A full list is included in this report in Appendix A. Many useful performance metrics are available for analysis, including Floating Point Instruction rates, Integer instruction rates, L1, L2, L3 cache usage statistics and processor load / store instruction rates.
9 Profiling Parallel Performance Using Vampir and Paraver Analysing DL_POLY VampirTrace files with Vampir The Vampir analyser can be invoked from the command line and the tracefile loaded through the menu options File -> Open Tracefile. The loading operation can take several minutes if the tracefiles are large Vampir Summary Chart The first analysis window to be generated is the Summary Chart, shown below in Figure 1: Figure 1. Vampir Summary Chart The black bar represents the Sum of the overall execution time on HPCx. This time is then broken down into three constituent parts the Application (i.e. computation) time in green, the MPI (i.e. communication) time in red and the VT_API (i.e. tracing overhead) time in blue. These representations are maintained throughout all the different Vampir views described here. From this display users can get an overall impression of the communication / computation ratio in their application code Vampir Activity Chart A useful way of identifying load imbalances between processors is to view the Global Activity Chart under the Global Displays menu. This view, shown in Figure 2 gives a
10 Profiling Parallel Performance Using Vampir and Paraver 10 breakdown of Application / MPI / VT_API ratio for each process involved in the execution of the program. The display below is for an eight processor DL_POLY run and it shows that communication and computation are relatively evenly distributed across the processors and therefore the load balancing is good. Figure 2. Vampir Global Activity Chart Global Timeline View Figure 3. Vampir Global Timeline View
11 Profiling Parallel Performance Using Vampir and Paraver 11 The Global Timeline gives an overall view of the application s parallel characteristics over the course of the complete tracing interval in this case the complete runtime. The time interval is measured along the horizontal access ( minutes here) and the processes are listed vertically. Message passing between processes is represented by the black (point-to-point) and purple (global communication operations) lines that link the process timelines. From the prevalence of purple in the above graphical representation it appears that communication in DL_POLY is mainly global, however this can be somewhat misleading as the purple messages overlay and obscure the black lines at this rather coarse zoom level. The proliferation of red MPI operations in the central part of the timeline could lead viewers to conclude that the code is highly communication intensive. However the above test run has much reduced timesteps compared to a production run and approximately the first twothirds of the global timeline represents a set-up phase that in reality would be substantially less significant. 4.4 Analysing parallel 3D FFT performance in DL_POLY Figure 4 shows how, by zooming in (left click with the mouse) on the right hand portion of the Global Timeline, we can obtain a more representative view of the run. This shows a series of timesteps which include phases of computation (green) separated by a series of global communications at the beginning and the end of each timestep. Here, the 3DFFTs, signified by black and red areas around the middle of each timestep, can just begin to be distinguished. Figure 4. DL_POLY Timesteps in the Global Timeline View
12 Profiling Parallel Performance Using Vampir and Paraver 12 By now selecting Global Displays -> Counter Timeline the selected (via the $VT_METRICS environment variable) hardware counters can be viewed on the same scale (Figure 5). Here we have chosen to run the code with the $VT_METRICS=PAPI_FP_OPS environment variable set in the loadleveler script, thereby measuring floating point operations throughout the application. Figure 5. Vampir Hardware Counter Timeline view of DL_POLY timesteps It can be seen that the flops/s rate peaks at around 100 Mflops/s per processor towards the centre of a timestep and reduces to virtually zero during the intensive global communication phases at the end of the timestep. Zooming further in (Figure 6), we can identify the subroutine in the program that the flop rate is at a maximum in the routines parallel_f (ft) (the number after the function name represents the number of times that the function has been called). The associated Counter Timeline is also shown below.
13 Profiling Parallel Performance Using Vampir and Paraver 13 Figure 6. Parallel 3D FFT in DL_POLY Timelines The characteristic communication pattern for a 3D FFT is shown clearly in Figure 6 i.e. pairwise point-to-point communications in firstly the x, then the y, then the z directions. Again, the corresponding counter timeline shows how the Flops/s rate reduces to almost zero during communication-dominated periods and serial performance peaks at around 100 Mflops/s during the FFT computation. A summary of the message passing statistics, highlighting the level of data transfer between processors can also be obtained (Figure 7). This shows how each processor
14 Profiling Parallel Performance Using Vampir and Paraver 14 transfers 8 Mbytes of data with three other processors, representing pair-wise communication in the x, y and z directions. Figure 7. Message Passing statistics for the 3D FFT Left clicking on any of the black point-to-point message lines in the 3D FFT timeline highlights the specified message and initiates a pop-up box with more details on this message passing instance. Shown in Figure 8 are the details of the message highlighted at the bottom right corner of the timeline in Figure 6. Figure 8. Individual Message Statistics
15 Profiling Parallel Performance Using Vampir and Paraver Profiling the NEMO application on large process counts using Vampir on HPCx An immediate drawback of VampirTrace when using large numbers of processes is the size of the trace files produced and consequently the amount of memory (and time) needed by Vampir when loading them. This may be alleviated by reducing the length of the benchmarking run itself (e.g. the number of timesteps that are requested) but ultimately it may be necessary to manually instrument the source code (as described in Section 4.2.2) such that data is only collected about the sections of the code that are of interest. For instance, the scaling performance of a code will not be affected by the performance of any start-up and initialisation routines and yet, for a small benchmarking run, this may take a significant fraction of the runtime. Below we show an example of a summary activity timeline generated by Vampir using a trace file from a manually-instrumented version of the NEMO source code. The large blue areas signify time when the code was not in any of the instrumented regions and is broken only by some initialisation and a region where tracing was (programmatically) switched on for a few timesteps midway through the run before being switched off again. Figure 9. The activity timeline generated from a manually-instrumented version of NEMO. It contains a little initialisation and then data for a few time-steps midway through the run.
16 Profiling Parallel Performance Using Vampir and Paraver 16 The full trace data for the few timesteps may be loaded by selecting the relevant region from the summary time-line. Since the tracing has been programmatically switched on for a set number of timesteps, the information provided by the resulting summary may be reliably compared for different runs since it is not dependent on the area of the activity timeline selected by the user. Below we show an example of such a summary where the code has been manually instrumented. Figure 10. Summary view of trace data for five timesteps of a manually instrumented version of NEMO running on 128 processes. Once the trace data has been loaded, the user often wishes to view the global timeline, an example of which is shown below for a single timestep of NEMO. A useful summary of this view may be obtained by right-clicking and selecting Components->Parallelism Display. This brings up the display visible at the bottom of the figure from which it is easy to determine which sections of the timestep are dominated by e.g. MPI communications (coloured red by default in Vampir). An example here is the section of NEMO dealing with ice transport processes (coloured bright green). Also of note in this example is the dominance of global communications (coloured purple) over the last 16 processes. It turns out that these processes have been allocated the region of the globe in the vicinity of the poles and thus have extra work to do in removing noise introduced by the small mesh size in this region.
17 Profiling Parallel Performance Using Vampir and Paraver 17 Figure 11. A global timeline for each of 64 processors during a single timestep of NEMO. A 'parallelism' display is included at the bottom showing the percentage of the processors involved in each activity at any one time. The usefulness of the global timeline can be limited when looking at tracefiles for numbers of processors greater than 64 as Vampir will try to scale the data for each process so as to fit them all on screen. However, one can specify the number of process timelines one would like displayed at a time by right-clicking on the display and selecting Options->Show Subset... This brings up the Show Subset Dialog: Figure 12. The Show Subset Dialog for the global timeline. Use this to choose the number of processors ( Bars ) for which data is shown on the timeline.
18 Profiling Parallel Performance Using Vampir and Paraver 18 Using this dialog one can look at the application's behaviour in detail on a few processes or look at the overall behaviour on many processes. The figure below shows VIPAR displaying the activity of the majority of 128 processes during the section of the code dealing with ice rheology in NEMO. The effect of the LPARS on HPCx (effectively 16-way SMP nodes) on interprocess communications is highlighted by the fact that groups of 16 well-synchronised processes may be identified. Figure 13. The global timeline configured to show data for the majority of the 128 processors of the job. 4.6 Identifying Load Imbalances in the Development of PDSYEVR Profiling early versions of the new ScaLapack routine PDSYEVR with VampirTrace (VT) allows us to investigate its performance in detail. Basic timing analysis of the code revealed that load-balancing problems may exist for certain datasets in the eigenvector calculation stage of the underlying tridiagonal eigensolver MRRR. The Vampir analyses shown below enabled us to track this potential inefficiency with great precision.
19 Profiling Parallel Performance Using Vampir and Paraver 19 In order to track the code in more detail, here different colours were assigned to different functions using the syntax which is described in $VT_HOME/info/GROUPS.SPEC. Some additions to the underlying source code are required and a re-compilation must be undertaken. In the timeline view shown in Figure 14 the cyan areas are set to represent computation in the subroutine DLARRV. This routine is involved in the calculation of eigenvectors. As usual, time spent in communication is represented by the red areas in the timeline and the purple lines represent individual messages passed between processors. Figure 14. Vampir Timeline for original DLARRV subroutine The above timeline trace shows that when calculating half the subset of eigenvalues, the workload balance in DLARRV increases substantially from process 0 to process 14. This causes a large communication overhead, represented by the large red areas in the trace. Following this, it was determined that the load imbalance was primarily caused by an unequal division of eigenvectors amongst the processes. These problems were addressed by the Scalapack developers and a newer version of the code gave a much better division of workload, as can be seen in the timeline traces in Figure 15.
20 Profiling Parallel Performance Using Vampir and Paraver 20 Figure 15. Vampir Timeline for modified DLARRV subroutine 5 PARAVER performance analysis on HPCx 5.1 Setting up Paraver Tracing on HPCx Paraver uses the tool OMPItrace to generate tracefiles for OpenMP programs, MPI programs, or mixed-mode OpenMP and MPI programs. Users should note that OMPItrace currently only works with 32-bit executables on HPCx and also that OMPItrace uses IBM's DPCL (Dynamic Probe Class Library) which requires a.rhosts file in your home directory that lists all the processor ids on HPCx. Paraver tracefiles are generated on HPCx by adding the environment variables (in e.g. ksh/bash): export OMPITRACE_HOME=/usr/local/packages/paraver export MPTRACE_COUNTGROUP=60 to the Loadleveler job control script the poe command in the LoadLeveler scriptis changed from e.g. poe./prog to $OMPITRACE_HOME/bin/ompitrace -counters -v poe.real./prog
21 Profiling Parallel Performance Using Vampir and Paraver 21 On HPCx poe is in fact a wrapper to the real poe command. In order for OMPITRACE to function correctly on HPCx poe.real must be called directly. 5.2 Viewing Paraver tracefiles on HPCx The following environment variables should be set in the user s login session: export PARAVER_HOME = /usr/local/packages/paraver export MPTRACE_COUNTGROUP=60 During the run, Paraver will have created a temporary trace file for each process (*.mpit and *.sim files). After the run has completed the user must submit an instruction to pack the individual profile files into one global output. This is undertaken by issuing the command: $PARAVER_HOME/bin/ompi2prv *.mpit -s *.sym -o trace_prm.prv To view the resulting tracefile use the command: $PARAVER_HOME/bin/paraver trace_prm.prv 5.3 Analysing the LUS2 application using Paraver Unlike Vampir, upon starting Paraver, users are immediately shown the Global Timeline view. This parallelisation of LUS2 is based on OpenMP, therefore threads rather than processes are listed on the vertical axis against time on the horizontal axis. Increasing the zoom in a representative section of the trace shows:
22 Profiling Parallel Performance Using Vampir and Paraver 22 Figure 16. Paraver Timeline for two cycles of $OMP PARALLEL DO The default colours assigned represent the following activities: Figure 17. Colour properties in Paraver
23 Profiling Parallel Performance Using Vampir and Paraver 23 The trace in Figure 16 shows a typical slice of the timeline from lus2, where the code is undertaking $OMP PARALLEL DO construct across the matrix as described in section 3.3. It can be seen that relatively large swathes of blue, representing computation, are divided by thread administration tasks at the start and end of each $OMP PARALLEL DO cycle. Figure 18. Detailed view of OMP thread scheduling in LUS2
24 Profiling Parallel Performance Using Vampir and Paraver 24 In Figure 18, above the timeline bar of each thread is a series of green flags, each denoting a change of state in the thread. Clicking on the flag gives a detailed description as shown in the example above. Here it can be seen that thread 16 is firstly undergoing a global synchronisation before being scheduled to run the next cycle of the loop. 6 Summary Profilers can be highly effective tools in the analysis of parallel programs on HPC architectures. They are particularly useful for identifying and measuring the effect of such problems as communication bottlenecks and load imbalances on the efficiency of codes. New versions of these tools also include hardware performance data which facilitates the detailed analysis of serial processor performance within a parallel run. The Vampir and Paraver GUI-based analysis tools allow users to switch with ease from global analyses of the parallel run to very detailed analyses of specific messages, all within the one profiling session. Interoperability of VampirTrace with other profilers such as KOJAK and TAU has now been made possible due to the adoption of the opentracefile format. Acknowledgements The authors would like to thank Matthias Jurenz from TU Dresden, Chris Johnson from EPCC University of Edinburgh, and Ilian Todorov & Ian Bush from STFC Daresbury Laboratory for their help in creating this report. 7 References [1] Vampir Performance Optimization [2] Vampirtrace, ZIH, Technische Universitat, Dresden, [3] Paraver, The European Center for Parallelism of Barcelona, [4] The DL_POLY Simulation Package, W. Smith, STFC Daresbury Laboratory,
25 Profiling Parallel Performance Using Vampir and Paraver 25 [5] PDSYEVR. ScaLAPACK s parallel MRRR algorithm for the symmetric eigenvalue problem, D. Antonelli, C. Vomel, Lapack working note 168, (2005). [6] OMPtrace Tool User s Guide, [7] The OpenMP Application Program Interface, [8] NEMO - Nucleus for European Modelling of the Ocean, [9] KOJAK Automatic Performance Analysis Toolset, Forschungszentrum Julich, [10] TAU Tuning and Analysis Utilities, University of Oregon, [11] The European Center for Parallelism of Barcelona, [12] Science & Technology Facilities Council, [13] A Parallel Implementation of SPME for DL_POLY 3, I. J. Bush and W. Smith, STFC Daresbury Laboratory, [14] A Parallel Eigensolver for Dense Symmetric Matrices based on Multiple Relatively Robust Representations, P.Bientinesi, I.S.Dhillon, R.A.van de Geijn, UT CS Technical Report #TR-03026, (2003) [15] [16] PAPI Performance Application Programming Interface Appendix A The list of available PAPI hardware-counters on HPCx. Test case avail.c: Available events and hardware information Vendor string and code : IBM (-1)
26 Profiling Parallel Performance Using Vampir and Paraver 26 Model string and code : POWER5 (8192) CPU Revision : CPU Megahertz : CPU's in this Node : 16 Nodes in this System : 1 Total CPU's : 16 Number Hardware Counters : 6 Max Multiplex Counters : Name Code Avail Deriv Description (Note) PAPI_L1_DCM 0x Yes Yes Level 1 data cache misses () PAPI_L1_ICM 0x No No Level 1 instruction cache misses () PAPI_L2_DCM 0x Yes No Level 2 data cache misses () PAPI_L2_ICM 0x Yes No Level 2 instruction cache misses () PAPI_L3_DCM 0x Yes Yes Level 3 data cache misses () PAPI_L3_ICM 0x Yes Yes Level 3 instruction cache misses () PAPI_L1_TCM 0x No No Level 1 cache misses () PAPI_L2_TCM 0x No No Level 2 cache misses () PAPI_L3_TCM 0x No No Level 3 cache misses () PAPI_CA_SNP 0x No No Requests for a snoop () PAPI_CA_SHR 0x a No No Requests for exclusive access to shared cache line () PAPI_CA_CLN 0x b No No Requests for exclusive access to clean cache line () PAPI_CA_INV 0x c No No Requests for cache line invalidation () PAPI_CA_ITV 0x d No No Requests for cache line intervention () PAPI_L3_LDM 0x e Yes Yes Level 3 load misses () PAPI_L3_STM 0x f No No Level 3 store misses () PAPI_BRU_IDL 0x No No Cycles branch units are idle () PAPI_FXU_IDL 0x Yes No Cycles integer units are idle () PAPI_FPU_IDL 0x No No Cycles floating point units are idle () PAPI_LSU_IDL 0x No No Cycles load/store units are idle () PAPI_TLB_DM 0x Yes No Data translation lookaside buffer misses ()
27 Profiling Parallel Performance Using Vampir and Paraver 27 PAPI_TLB_IM 0x Yes No Instruction translation lookaside buffer misses () PAPI_TLB_TL 0x Yes Yes Total translation lookaside buffer misses () PAPI_L1_LDM 0x Yes No Level 1 load misses () PAPI_L1_STM 0x Yes No Level 1 store misses () PAPI_L2_LDM 0x Yes No Level 2 load misses () PAPI_L2_STM 0x a No No Level 2 store misses () PAPI_BTAC_M 0x b No No Branch target address cache misses () PAPI_PRF_DM 0x c No No Data prefetch cache misses () PAPI_L3_DCH 0x d No No Level 3 data cache hits () PAPI_TLB_SD 0x e No No Translation lookaside buffer shootdowns () PAPI_CSR_FAL 0x f No No Failed store conditional instructions () PAPI_CSR_SUC 0x No No Successful store conditional instructions () PAPI_CSR_TOT 0x No No Total store conditional instructions () PAPI_MEM_SCY 0x No No Cycles Stalled Waiting for memory accesses () PAPI_MEM_RCY 0x No No Cycles Stalled Waiting for memory Reads () PAPI_MEM_WCY 0x No No Cycles Stalled Waiting for memory writes () PAPI_STL_ICY 0x Yes No Cycles with no instruction issue () PAPI_FUL_ICY 0x No No Cycles with maximum instruction issue () PAPI_STL_CCY 0x No No Cycles with no instructions completed () PAPI_FUL_CCY 0x No No Cycles with maximum instructions completed () PAPI_HW_INT 0x Yes No Hardware interrupts () PAPI_BR_UCN 0x a No No Unconditional branch instructions () PAPI_BR_CN 0x b No No Conditional branch instructions () PAPI_BR_TKN 0x c No No Conditional branch instructions taken () PAPI_BR_NTK 0x d No No Conditional branch instructions not taken () PAPI_BR_MSP 0x e Yes Yes Conditional branch instructions mispredicted ()
28 Profiling Parallel Performance Using Vampir and Paraver 28 PAPI_BR_PRC 0x f No No Conditional branch instructions correctly predicted () PAPI_FMA_INS 0x Yes No FMA instructions completed () PAPI_TOT_IIS 0x Yes No Instructions issued () PAPI_TOT_INS 0x Yes No Instructions completed () PAPI_INT_INS 0x Yes No Integer instructions () PAPI_FP_INS 0x Yes No Floating point instructions () PAPI_LD_INS 0x Yes No Load instructions () PAPI_SR_INS 0x Yes No Store instructions () PAPI_BR_INS 0x Yes No Branch instructions () PAPI_VEC_INS 0x No No Vector/SIMD instructions () PAPI_RES_STL 0x No No Cycles stalled on any resource () PAPI_FP_STAL 0x a No No Cycles the FP unit(s) are stalled () PAPI_TOT_CYC 0x b Yes No Total cycles () PAPI_LST_INS 0x c Yes Yes Load/store instructions completed () PAPI_SYC_INS 0x d No No Synchronization instructions completed () PAPI_L1_DCH 0x e No No Level 1 data cache hits () PAPI_L2_DCH 0x f No No Level 2 data cache hits () PAPI_L1_DCA 0x Yes Yes Level 1 data cache accesses () PAPI_L2_DCA 0x No No Level 2 data cache accesses () PAPI_L3_DCA 0x No No Level 3 data cache accesses () PAPI_L1_DCR 0x Yes No Level 1 data cache reads () PAPI_L2_DCR 0x No No Level 2 data cache reads () PAPI_L3_DCR 0x Yes No Level 3 data cache reads () PAPI_L1_DCW 0x Yes No Level 1 data cache writes () PAPI_L2_DCW 0x No No Level 2 data cache writes () PAPI_L3_DCW 0x No No Level 3 data cache writes () PAPI_L1_ICH 0x Yes No Level 1 instruction cache hits () PAPI_L2_ICH 0x a No No Level 2 instruction cache hits () PAPI_L3_ICH 0x b No No Level 3 instruction cache hits () PAPI_L1_ICA 0x c No No Level 1 instruction cache accesses () PAPI_L2_ICA 0x d No No Level 2 instruction cache accesses () PAPI_L3_ICA 0x e Yes No Level 3 instruction cache accesses () PAPI_L1_ICR 0x f No No Level 1 instruction cache reads ()
29 Profiling Parallel Performance Using Vampir and Paraver 29 PAPI_L2_ICR 0x No No Level 2 instruction cache reads () PAPI_L3_ICR 0x No No Level 3 instruction cache reads () PAPI_L1_ICW 0x No No Level 1 instruction cache writes () PAPI_L2_ICW 0x No No Level 2 instruction cache writes () PAPI_L3_ICW 0x No No Level 3 instruction cache writes () PAPI_L1_TCH 0x No No Level 1 total cache hits () PAPI_L2_TCH 0x No No Level 2 total cache hits () PAPI_L3_TCH 0x No No Level 3 total cache hits () PAPI_L1_TCA 0x No No Level 1 total cache accesses () PAPI_L2_TCA 0x No No Level 2 total cache accesses () PAPI_L3_TCA 0x a No No Level 3 total cache accesses () PAPI_L1_TCR 0x b No No Level 1 total cache reads () PAPI_L2_TCR 0x c No No Level 2 total cache reads () PAPI_L3_TCR 0x d No No Level 3 total cache reads () PAPI_L1_TCW 0x e No No Level 1 total cache writes () PAPI_L2_TCW 0x f No No Level 2 total cache writes () PAPI_L3_TCW 0x No No Level 3 total cache writes () PAPI_FML_INS 0x No No Floating point multiply instructions () PAPI_FAD_INS 0x No No Floating point add instructions () PAPI_FDV_INS 0x Yes No Floating point divide instructions () PAPI_FSQ_INS 0x Yes No Floating point square root instructions () PAPI_FNV_INS 0x No No Floating point inverse instructions () PAPI_FP_OPS 0x Yes Yes Floating point operations () avail.c PASSED
PAPI Software Specification
PAPI Software Specification This software specification describes the PAPI 3.0 Release, and is current as of March 08, 2004. It consists of the following sections: Introduction to PAPI Constants Standardized
More informationDresden, September Dan Terpstra Jack Dongarra Shirley Moore. Heike Jagode
Collecting Performance Data with PAPI-C 3rd Parallel Tools Workshop 3rd Parallel Tools Workshop Dresden, September 14-15 Dan Terpstra Jack Dongarra Shirley Moore Haihang You Heike Jagode Hardware performance
More informationPAPI Programmer s Reference
PAPI Programmer s Reference This document is a compilation of the reference material needed by a programmer to effectively use PAPI. It is identical to the material found in the PAPI man pages, but organized
More informationPerformance analysis basics
Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis
More informationCOSC 6374 Parallel Computation. Performance Oriented Software Design. Edgar Gabriel. Spring Amdahl s Law
COSC 6374 Parallel Computation Performance Oriented Software Design Spring 2008 Amdahl s Law Describes the performance gains by enhancing one part of the overall system (code, computer) Speedup = Performance
More informationPAPI Programmer s Reference
PAPI Programmer s Reference This document is a compilation of the reference material needed by a programmer to effectively use PAPI. It is identical to the material found in the PAPI man pages, but organized
More informationA Portable Programming Interface for Performance Evaluation on Modern Processors
A Portable Programming Interface for Performance Evaluation on Modern Processors S. Browne *, J Dongarra, N. Garner *, K. London *, P. Mucci * Abstract The purpose of the PAPI project is to specify a standard
More informationHiPERiSM Consulting, LLC.
HiPERiSM Consulting, LLC. George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill, NC 27514 george@hiperism.com http://www.hiperism.com Models-3 User s Conference September
More informationOn the scalability of tracing mechanisms 1
On the scalability of tracing mechanisms 1 Felix Freitag, Jordi Caubet, Jesus Labarta Departament d Arquitectura de Computadors (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat Politècnica
More informationPerformance Metrics for Ocean and Air Quality Models on Commodity Linux Platforms
Performance Metrics for Ocean and Air Quality Models on Commodity Linux Platforms George Delic george@hiperism.com HiPERiSM Consulting, LLC Durham, North Carolina Abstract. This report examines performance
More informationVAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW
VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW 8th VI-HPS Tuning Workshop at RWTH Aachen September, 2011 Tobias Hilbrich and Joachim Protze Slides by: Andreas Knüpfer, Jens Doleschal, ZIH, Technische Universität
More information[Scalasca] Tool Integrations
Mitglied der Helmholtz-Gemeinschaft [Scalasca] Tool Integrations Aug 2011 Bernd Mohr CScADS Performance Tools Workshop Lake Tahoe Contents Current integration of various direct measurement tools Paraver
More informationAteles performance assessment report
Ateles performance assessment report Document Information Reference Number Author Contributor(s) Date Application Service Level Keywords AR-4, Version 0.1 Jose Gracia (USTUTT-HLRS) Christoph Niethammer,
More informationTutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE
Tutorial: Analyzing MPI Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Contents Legal Information... 3 1. Overview... 4 1.1. Prerequisites... 5 1.1.1. Required Software... 5 1.1.2.
More informationPerformance Analysis with Vampir
Performance Analysis with Vampir Johannes Ziegenbalg Technische Universität Dresden Outline Part I: Welcome to the Vampir Tool Suite Event Trace Visualization The Vampir Displays Vampir & VampirServer
More informationSCIENTIFIC COMPUTING FOR ENGINEERS
4/26/16 CS 594: SCIENTIFIC COMPUTING FOR ENGINEERS PAPI Performance Application Programming Interface Heike Jagode jagode@icl.utk.edu OUTLINE 1. Motivation What is Performance? Why being annoyed with Performance
More informationPerformance Analysis of Parallel Scientific Applications In Eclipse
Performance Analysis of Parallel Scientific Applications In Eclipse EclipseCon 2015 Wyatt Spear, University of Oregon wspear@cs.uoregon.edu Supercomputing Big systems solving big problems Performance gains
More informationAdvanced Topics UNIT 2 PERFORMANCE EVALUATIONS
Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors
More informationLAPI on HPS Evaluating Federation
LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of
More informationA Trace-Scaling Agent for Parallel Application Tracing 1
A Trace-Scaling Agent for Parallel Application Tracing 1 Felix Freitag, Jordi Caubet, Jesus Labarta Computer Architecture Department (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat
More informationParallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware
Parallelism V HPC Profiling John Cavazos Dept of Computer & Information Sciences University of Delaware Lecture Overview Performance Counters Profiling PAPI TAU HPCToolkit PerfExpert Performance Counters
More informationCOMP Superscalar. COMPSs Tracing Manual
COMP Superscalar COMPSs Tracing Manual Version: 2.4 November 9, 2018 This manual only provides information about the COMPSs tracing system. Specifically, it illustrates how to run COMPSs applications with
More informationIntroducing OTF / Vampir / VampirTrace
Center for Information Services and High Performance Computing (ZIH) Introducing OTF / Vampir / VampirTrace Zellescher Weg 12 Willers-Bau A115 Tel. +49 351-463 - 34049 (Robert.Henschel@zih.tu-dresden.de)
More informationPAPI Performance API. Shirley Moore 8th VI-HPS Tuning Workshop 5-9 September 2011
PAPI Performance API Shirley Moore shirley@eecs.utk.edu 8th VI-HPS Tuning Workshop 5-9 September 2011 PAPI Team Vince Weaver Post Doc Kiran Kasichayanula Masters Student Jack Dongarra Distinguished Prof.
More informationTools and techniques for optimization and debugging. Fabio Affinito October 2015
Tools and techniques for optimization and debugging Fabio Affinito October 2015 Profiling Why? Parallel or serial codes are usually quite complex and it is difficult to understand what is the most time
More informationDetection and Analysis of Iterative Behavior in Parallel Applications
Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University
More informationAutomatic trace analysis with the Scalasca Trace Tools
Automatic trace analysis with the Scalasca Trace Tools Ilya Zhukov Jülich Supercomputing Centre Property Automatic trace analysis Idea Automatic search for patterns of inefficient behaviour Classification
More informationA Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures
A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative
More informationPerformance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr
Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr Research Centre Juelich (FZJ) John von Neumann Institute of Computing (NIC) Central Institute for Applied Mathematics (ZAM) 52425
More informationPerformance Analysis of AERMOD on Commodity Platforms
Performance Analysis of AERMOD on Commodity Platforms George Delic george@hiperism.com HiPERiSM Consulting, LLC Durham, North Carolina Abstract. This report examines performance of the AERMOD Air Quality
More informationVisual Profiler. User Guide
Visual Profiler User Guide Version 3.0 Document No. 06-RM-1136 Revision: 4.B February 2008 Visual Profiler User Guide Table of contents Table of contents 1 Introduction................................................
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationIntroduction to Parallel Performance Engineering
Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance:
More informationVIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING. BSC Tools Hands-On. Germán Llort, Judit Giménez. Barcelona Supercomputing Center
BSC Tools Hands-On Germán Llort, Judit Giménez Barcelona Supercomputing Center 2 VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING Getting a trace with Extrae Extrae features Platforms Intel, Cray, BlueGene,
More informationARCHER Single Node Optimisation
ARCHER Single Node Optimisation Profiling Slides contributed by Cray and EPCC What is profiling? Analysing your code to find out the proportion of execution time spent in different routines. Essential
More informationPAPI - PERFORMANCE API. ANDRÉ PEREIRA
PAPI - PERFORMANCE API ANDRÉ PEREIRA ampereira@di.uminho.pt 1 Motivation Application and functions execution time is easy to measure time gprof valgrind (callgrind) It is enough to identify bottlenecks,
More informationIn examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured
System Performance Analysis Introduction Performance Means many things to many people Important in any design Critical in real time systems 1 ns can mean the difference between system Doing job expected
More informationVampir 8 User Manual
Vampir 8 User Manual Copyright c 2013 GWT-TUD GmbH Blasewitzer Str. 43 01307 Dresden, Germany http://gwtonline.de Support / Feedback / Bugreports Please provide us feedback! We are very interested to hear
More informationPAPI Performance Application Programming Interface (adapted by Fengguang Song)
1/17/18 PAPI Performance Application Programming Interface (adapted by Fengguang Song) Heike McCraw mccraw@icl.utk.edu To get more details, please read the manual: http://icl.cs.utk.edu/projects/papi/wiki/papi3:
More informationPerformance Profiling
Performance Profiling Minsoo Ryu Real-Time Computing and Communications Lab. Hanyang University msryu@hanyang.ac.kr Outline History Understanding Profiling Understanding Performance Understanding Performance
More informationTransactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN
The implementation of a general purpose FORTRAN harness for an arbitrary network of transputers for computational fluid dynamics J. Mushtaq, A.J. Davies D.J. Morgan ABSTRACT Many Computational Fluid Dynamics
More informationScalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany
Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP
More informationSC12 HPC Educators session: Unveiling parallelization strategies at undergraduate level
SC12 HPC Educators session: Unveiling parallelization strategies at undergraduate level E. Ayguadé, R. M. Badia, D. Jiménez, J. Labarta and V. Subotic August 31, 2012 Index Index 1 1 The infrastructure:
More informationIntroduction to Performance Tuning & Optimization Tools
Introduction to Performance Tuning & Optimization Tools a[i] a[i+1] + a[i+2] a[i+3] b[i] b[i+1] b[i+2] b[i+3] = a[i]+b[i] a[i+1]+b[i+1] a[i+2]+b[i+2] a[i+3]+b[i+3] Ian A. Cosden, Ph.D. Manager, HPC Software
More informationVampir 8 User Manual
Vampir 8 User Manual Copyright c 2012 GWT-TUD GmbH Blasewitzer Str. 43 01307 Dresden, Germany http://gwtonline.de Support / Feedback / Bugreports Please provide us feedback! We are very interested to hear
More informationPAPI - PERFORMANCE API. ANDRÉ PEREIRA
PAPI - PERFORMANCE API ANDRÉ PEREIRA ampereira@di.uminho.pt 1 Motivation 2 Motivation Application and functions execution time is easy to measure time gprof valgrind (callgrind) 2 Motivation Application
More informationSCALASCA parallel performance analyses of SPEC MPI2007 applications
Mitglied der Helmholtz-Gemeinschaft SCALASCA parallel performance analyses of SPEC MPI2007 applications 2008-05-22 Zoltán Szebenyi Jülich Supercomputing Centre, Forschungszentrum Jülich Aachen Institute
More informationRecent Advances in the Performance API (PAPI)
Recent Advances in the Performance API (PAPI) Collaborators: Heike Jagode Asim Yarkhan Jack Dongarra 10 th Scalable Tools Workshop Anthony Danalis Lake Tahoe, California August 1-4, 2016 PAPI Middleware
More informationTools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,
Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon
More informationManual SmartGraph for Humlog 10
Manual SmartGraph for Humlog 10 State: 10.12.2001 Version: V1.0 1 1 INTRODUCTION TO SMARTGRAPH... 4 1.1 Manage, Configure... 4 1.2 The Programme Interface... 4 1.2.1 Graphs... 5 1.2.2 Table... 6 1.2.3
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationSCALASCA v1.0 Quick Reference
General SCALASCA is an open-source toolset for scalable performance analysis of large-scale parallel applications. Use the scalasca command with appropriate action flags to instrument application object
More informationAn Implementation of the POMP Performance Monitoring for OpenMP based on Dynamic Probes
An Implementation of the POMP Performance Monitoring for OpenMP based on Dynamic Probes Luiz DeRose IBM Research ACTC Yorktown Heights, NY USA laderose@us.ibm.com Bernd Mohr Forschungszentrum Jülich ZAM
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationScore-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1
Score-P SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1 Score-P Functionality Score-P is a joint instrumentation and measurement system for a number of PA tools. Provide
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationPerformance Tools Hands-On. PATC Apr/2016.
Performance Tools Hands-On PATC Apr/2016 tools@bsc.es Accounts Users: nct010xx Password: f.23s.nct.0xx XX = [ 01 60 ] 2 Extrae features Parallel programming models MPI, OpenMP, pthreads, OmpSs, CUDA, OpenCL,
More informationBusiness Intelligence and Reporting Tools
Business Intelligence and Reporting Tools Release 1.0 Requirements Document Version 1.0 November 8, 2004 Contents Eclipse Business Intelligence and Reporting Tools Project Requirements...2 Project Overview...2
More informationThe Role of Performance
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware
More informationPoint-to-Point Synchronisation on Shared Memory Architectures
Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationProf. Thomas Sterling
High Performance Computing: Concepts, Methods & Means Performance Measurement 1 Prof. Thomas Sterling Department of Computer Science Louisiana i State t University it February 13 th, 2007 News Alert! Intel
More informationProfiling: Understand Your Application
Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel
More informationPerformance Analysis with Vampir
Performance Analysis with Vampir Ronald Geisler, Holger Brunst, Bert Wesarg, Matthias Weber, Hartmut Mix, Ronny Tschüter, Robert Dietrich, and Andreas Knüpfer Technische Universität Dresden Outline Part
More informationEvaluation of Profiling Tools for the Acquisition of Time Independent Traces
Evaluation of Profiling Tools for the Acquisition of Time Independent Traces Frédéric Desprez, George S. Markomanolis, Frédéric Suter TECHNICAL REPORT N 437 July 2013 Project-Team AVALON ISSN 0249-0803
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationData Partitioning. Figure 1-31: Communication Topologies. Regular Partitions
Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationMPI Performance Tools
Physics 244 31 May 2012 Outline 1 Introduction 2 Timing functions: MPI Wtime,etime,gettimeofday 3 Profiling tools time: gprof,tau hardware counters: PAPI,PerfSuite,TAU MPI communication: IPM,TAU 4 MPI
More informationLAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1
LAPACK Linear Algebra PACKage 1 Janice Giudice David Knezevic 1 Motivating Question Recalling from last week... Level 1 BLAS: vectors ops Level 2 BLAS: matrix-vectors ops 2 2 O( n ) flops on O( n ) data
More informationVAMPIR & VAMPIRTRACE Hands On
VAMPIR & VAMPIRTRACE Hands On PRACE Spring School 2012 in Krakow May, 2012 Holger Brunst Slides by: Andreas Knüpfer, Jens Doleschal, ZIH, Technische Universität Dresden Hands-on: NPB Build Copy NPB sources
More informationBatch Jobs Performance Testing
Batch Jobs Performance Testing October 20, 2012 Author Rajesh Kurapati Introduction Batch Job A batch job is a scheduled program that runs without user intervention. Corporations use batch jobs to automate
More informationAnalyzing I/O Performance on a NEXTGenIO Class System
Analyzing I/O Performance on a NEXTGenIO Class System holger.brunst@tu-dresden.de ZIH, Technische Universität Dresden LUG17, Indiana University, June 2 nd 2017 NEXTGenIO Fact Sheet Project Research & Innovation
More informationVampir 9 User Manual
Vampir 9 User Manual Copyright c 2018 GWT-TUD GmbH Freiberger Str. 33 01067 Dresden, Germany http://gwtonline.de Support / Feedback / Bug Reports Please provide us feedback! We are very interested to hear
More information( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems
( ZIH ) Center for Information Services and High Performance Computing Event Tracing and Visualization for Cell Broadband Engine Systems ( daniel.hackenberg@zih.tu-dresden.de ) Daniel Hackenberg Cell Broadband
More informationOverview. Timers. Profilers. HPM Toolkit
Overview Timers Profilers HPM Toolkit 2 Timers Wide range of timers available on the HPCx system Varying precision portability language ease of use 3 Timers Timer Usage Wallclock/C PU Resolution Language
More informationUsing VTK and the OpenGL Graphics Libraries on HPCx
Using VTK and the OpenGL Graphics Libraries on HPCx Jeremy Nowell EPCC The University of Edinburgh Edinburgh EH9 3JZ Scotland, UK April 29, 2005 Abstract Some of the graphics libraries and visualisation
More informationPerformance Analysis of the MPAS-Ocean Code using HPCToolkit and MIAMI
Performance Analysis of the MPAS-Ocean Code using HPCToolkit and MIAMI Gabriel Marin February 11, 2014 MPAS-Ocean [4] is a component of the MPAS framework of climate models. MPAS-Ocean is an unstructured-mesh
More informationHybrid Programming with MPI and SMPSs
Hybrid Programming with MPI and SMPSs Apostolou Evangelos August 24, 2012 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2012 Abstract Multicore processors prevail
More informationThe PAPI Cross-Platform Interface to Hardware Performance Counters
The PAPI Cross-Platform Interface to Hardware Performance Counters Kevin London, Shirley Moore, Philip Mucci, and Keith Seymour University of Tennessee-Knoxville {london, shirley, mucci, seymour}@cs.utk.edu
More informationPerformance Analysis for Large Scale Simulation Codes with Periscope
Performance Analysis for Large Scale Simulation Codes with Periscope M. Gerndt, Y. Oleynik, C. Pospiech, D. Gudu Technische Universität München IBM Deutschland GmbH May 2011 Outline Motivation Periscope
More informationPCAN-Explorer 6. Tel: Professional Windows Software to Communicate with CAN and CAN FD Busses. Software >> PC Software
PCAN-Explorer 6 Professional Windows Software to Communicate with CAN and CAN FD Busses The PCAN-Explorer 6 is a versatile, professional program for working with CAN and CAN FD networks. The user is not
More information!OMP #pragma opm _OPENMP
Advanced OpenMP Lecture 12: Tips, tricks and gotchas Directives Mistyping the sentinel (e.g.!omp or #pragma opm ) typically raises no error message. Be careful! The macro _OPENMP is defined if code is
More informationNEXTGenIO Performance Tools for In-Memory I/O
NEXTGenIO Performance Tools for In- I/O holger.brunst@tu-dresden.de ZIH, Technische Universität Dresden 22 nd -23 rd March 2017 Credits Intro slides by Adrian Jackson (EPCC) A new hierarchy New non-volatile
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationPerformance Analysis with Vampir. Joseph Schuchart ZIH, Technische Universität Dresden
Performance Analysis with Vampir Joseph Schuchart ZIH, Technische Universität Dresden 1 Mission Visualization of dynamics of complex parallel processes Full details for arbitrary temporal and spatial levels
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationDynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection
Numerical Libraries in the DOE ACTS Collection The DOE ACTS Collection SIAM Parallel Processing for Scientific Computing, Savannah, Georgia Feb 15, 2012 Tony Drummond Computational Research Division Lawrence
More informationUsing Java for Scientific Computing. Mark Bul EPCC, University of Edinburgh
Using Java for Scientific Computing Mark Bul EPCC, University of Edinburgh markb@epcc.ed.ac.uk Java and Scientific Computing? Benefits of Java for Scientific Computing Portability Network centricity Software
More informationParallel Performance and Optimization
Parallel Performance and Optimization Gregory G. Howes Department of Physics and Astronomy University of Iowa Iowa High Performance Computing Summer School University of Iowa Iowa City, Iowa 25-26 August
More informationParallel Performance Analysis Using the Paraver Toolkit
Parallel Performance Analysis Using the Paraver Toolkit Parallel Performance Analysis Using the Paraver Toolkit [16a] [16a] Slide 1 University of Stuttgart High-Performance Computing Center Stuttgart (HLRS)
More informationIntegrating Parallel Application Development with Performance Analysis in Periscope
Technische Universität München Integrating Parallel Application Development with Performance Analysis in Periscope V. Petkov, M. Gerndt Technische Universität München 19 April 2010 Atlanta, GA, USA Motivation
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationVAMPIR & VAMPIRTRACE Hands On
VAMPIR & VAMPIRTRACE Hands On 8th VI-HPS Tuning Workshop at RWTH Aachen September, 2011 Tobias Hilbrich and Joachim Protze Slides by: Andreas Knüpfer, Jens Doleschal, ZIH, Technische Universität Dresden
More informationMixed Mode MPI / OpenMP Programming
Mixed Mode MPI / OpenMP Programming L.A. Smith Edinburgh Parallel Computing Centre, Edinburgh, EH9 3JZ 1 Introduction Shared memory architectures are gradually becoming more prominent in the HPC market,
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationImproving Applica/on Performance Using the TAU Performance System
Improving Applica/on Performance Using the TAU Performance System Sameer Shende, John C. Linford {sameer, jlinford}@paratools.com ParaTools, Inc and University of Oregon. April 4-5, 2013, CG1, NCAR, UCAR
More informationISC 09 Poster Abstract : I/O Performance Analysis for the Petascale Simulation Code FLASH
ISC 09 Poster Abstract : I/O Performance Analysis for the Petascale Simulation Code FLASH Heike Jagode, Shirley Moore, Dan Terpstra, Jack Dongarra The University of Tennessee, USA [jagode shirley terpstra
More informationBeyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy
EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More information