TAU by example - Mpich

Size: px

Start display at page:

Download "TAU by example - Mpich"

Ellen Cox
5 years ago
Views:

1 TAU by example From Mpich TAU (Tuning and Analysis Utilities) is a toolkit for profiling and tracing parallel programs written in C, C++, Fortran and others. It supports dynamic (librarybased), compiler and source-level instrumentation. Unlike MPE, TAU is not limited to profiling MPI code, being geared towards parallel programming in general, including CUDA, OpenMP and regular pthreads. ParaProf with QMCPACK As TAU is already ( extensively ( documented ( this page will only provide a short introduction to some common features, along with some basic example code. Contents 1 Installation 1.1 Desktop linux (ubuntu ) 2 Wave2D 2.1 Description 2.2 Profiling Dynamic instrumentation Source instrumentation Compiler-based instrumentation Selective instrumentation 2.3 Visualization pprof ParaProf 2.4 Tracing 1 of :51

2 3 Ring 4 NWChem 5 Hardware counters 6 Notes 7 External links Installation Desktop linux (ubuntu ) After downloading both TAU and PDT from here ( /Research/tau/downloads.php), unpack them wherever convenient and run the following command to configure, compile and install TAU. If MPICH2/PDT were installed on another prefix, the part within brackets must be appropriately setup: set $MPI_PATH to the directory where MPICH2 was installed, and adjust $PDT_DIR accordingly. Otherwise, it can be left out. (If MPICH2 was configured with --enable-shared, it is not necessary to pass the -mpilibrary argument below.) %./configure -mpilibrary='-lmpich -lmpl -lopa' [-mpilib=$mpi_path/lib -mpiinc=$mpi_path/include -pdt= % make -j clean install For more information, please refer to its installation manual ( /bk04ch01.html#installing.tau). Wave2D Description The wave equation is a partial differential equation used to describe the behavior of waves as they occur in physics: Here, the variable represents a physical measure such as pressure or water depth. This equation can be discretized over a 2D grid using a finite differencing scheme, leading to the following update rule: Solution to the wave equation 2 of :51

3 where is defined over a rectangular 2D grid in dimensions,, and time step. The constant defines the wave's propagation speed. We hence obtain the iterative five-point stencil code wave2d. Profiling A list of the different instrumentation methods used for e.g. profiling (and which features they support) can be found here ( /tau/docs/usersguide/ch01.html). We will cover three different methods: [1] ( Dynamic: statistical sampling; /TAU_by_example#endnote_dynamicInstAltName) Source: parser-aided automatic code categorization; Selective: uses a separate file to "manually control which parts of the application are profiled and how they are profiled"; this is technically part of the source instrumentation above (and hence also requires PDT). Dynamic instrumentation The most straightforward way of getting started with TAU is through tau_exec, which does dynamic instrumentation. This method makes use of statistical sampling to estimate which percentage of the execution time is taken by each function (as well as the absolute time spent). We first compile as usual: % mpicxx wave2d.cpp -o wave2d The difference occurs when executing the program, which is done as follows: [2] ( % mpirun -np 12 tau_exec./wave2d Unfortunately, this method does not support profiling user-defined routines, only those from MPI. For this reason, we do not recommend this method for real-world applications (which very likely spend less time on MPI calls than on computation itself, which would be unaccounted for). Source instrumentation If PDT is installed, it is possible to use the compiler wrapper scripts 3 of :51

4 tau_cc.sh/tau_cxx.sh to automatically instrument our code. In this case, the difference is that one call to TAU_PROFILE is inserted in every user-defined function, with this modified function's header. /TAU_by_example#endnote_optKeepFiles) [3] ( First, we select which features we want TAU to use (e.g. MPI support, tracing, CUDA hardware counters, etc), which is done by setting an environment variable called TAU_MAKEFILE to point to one of the (informatively-named) default Makefiles that are located in <TAU_HOME>/lib/. For now, we only want to profile MPI code, so we write: % export TAU_MAKEFILE=$TAU_HOME/lib/Makefile.tau-mpi-pdt Next, to build the instrumented wave2d program, we replace the regular mpicc or mpicxx commands by tau_cc.sh or tau_cxx.sh. % tau_cxx.sh wave2d.cpp -o wave2d If all goes right, we can then execute the code as usual: % mpirun -np 12./wave2d Compiler-based instrumentation Additionally, TAU has a compiler instrumentation method, which sits in between dynamic and source. Unlike dynamic, it requires compilation, but also inherits some features from source such as being able to profile user-defined functions. However, it cannot provide information about finer constructs such as loops and so on. To use this method, the argument -tau_options=-optcompinst should be added to tau_cc.sh/tau_cxx.sh when compiling (or, equivalently, to the TAU_OPTIONS environment variable); visualization and program execution remain exactly the same. In practice, we recommend installing PDT and using the source mode. Selective instrumentation A large program might have dozens of auxiliary functions that do not constitute a significant chunk of the execution time, and hence visually pollute the profile. Also, it may happen that a function will have two or more time-consuming loops, which will not be individually represented in the profile. 4 of :51

5 For this and other reasons we may want to selectively exclude functions, annotate (outer) loops, etc, using TAU's support for selectively profiling applications. Consider the following example, which we will name select.tau: BEGIN_EXCLUDE_LIST void foo(int *, double) void bartoo_#(int *) END_EXCLUDE_LIST BEGIN_INSTRUMENT_SECTION loops file="random.cpp" routine="int FooClass::fooToo(double, double)" END_INSTRUMENT_SECTION Here, the symbol # in the function names acts as a wildcard. To make use of this file, we define the TAU_OPTIONS environment variable: % export TAU_OPTIONS="-optTauSelectFile=select.tau" Warning: This does not as expected with regular C code. After every function in the exclude list, a C should be added. Otherwise, a # should be added after every function name.this is a consequence of the design used by TAU, and is effectively arbitrary. For some more information, check the official manual ( /Research/tau/docs/newguide/bk01ch01s03.html). Visualization Regardless of the instrumentation method, we will then obtain a number of profile.r.* files (r being a rank). These can now be visualized in a number of manners. pprof Text-based; can be invoked by a simple pprof: % pprof Reading Profile files in profile.* NODE 0;CONTEXT 0;THREAD 0: %Time Exclusive Inclusive #Call #Subrs Inclusive Name 5 of :51

6 msec total msec usec/call , int main(int, char **) , void Grid::doIterations(int) ,365 2, void Grid::doOneIteration() , void Grid::exchangeEdges() ,412 1, MPI_Waitall() void Grid::Grid(int, int, int, int, void Grid::initGrid(int) MPI_Finalize() MPI_Init() MPI_Send() void Grid::~Grid() MPI_Irecv() MPI_Bcast() MPI_Comm_rank() MPI_Comm_size() USER EVENTS Profile :NODE 0, CONTEXT 0, THREAD 0 NumSamples MaxValue MinValue MeanValue Std. Dev. Event Name Message size for broadcast... the same table for every rank... FUNCTION SUMMARY (total): %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call , int main(int, char **) , void Grid::doIterations(int) ,121 29, void Grid::doOneIteration() , void Grid::exchangeEdges() ,075 16, MPI_Waitall() void Grid::Grid(int, int, int, int, void Grid::initGrid(int) MPI_Finalize() MPI_Init() MPI_Bcast() MPI_Send() void Grid::~Grid() MPI_Irecv() MPI_Comm_rank() MPI_Comm_size()... the same table, but now with mean values... More information about how to sort the data differently can be found by doing pprof -h. ParaProf A much richer, graphical interface for visualization and analysis, ParaProf can also be initialized simply by paraprof. 6 of :51

7 The vast majority of the functionality can be found by navigating the menus. For instance, a graph of functions ordered by time taken on average can be obtained by right-clicking Mean and selecting Show Mean Bar Chart. ParaProf with inclusive metric Analogously, it is possible to see the execution time for a given function on all ranks. A way of doing it is right clicking on the corresponding function bar, and going to Show Function Bar Chart. Alternatively, the menu Windows->Function->Bar Chart displays a list of all profiled functions. Inclusive execution time per routine on one rank Tracing In addition to profiling, TAU can automatically instrument the code to do tracing without user intervention (unlike MPE, which requires manual insertion of tracing code). This depends on PDT (in particular, on its C/C++ parser), which is why we used a makefile ending with pdt in the previous section. However, tracing is disabled by default, so we enable it first: % export TAU_TRACE=1 Jumpshot main window (Note that this disables profiling, i.e. no profile.* files will be generated.) Now the program can be executed normally (with mpirun in this case), which will generate many.edf and.trc files. These must then be merged as follows: % tau_treemerge.pl % tau2slog2 tau.trc tau.edf -o tau.slog2 7 of :51

8 All that is left is to visualizer the tracing data (tau.slog2) with Jumpshot (or another such tool), as TAU does not have a tracing visualizer. Notice on the legend window that all functions have been automatically instrumented. Ring Jumpshot legend window ParaProf plot for time spent per function/per rank This toy example implements a ring whose elements (ranks) asynchronously send their successor a buffer whose size is dependent on their rank (as well as the amount of work needed to prepare it). The image to the left, which is a three-dimensional variation of the original 2D profiler plotter (on the top of the page). It can be accessed by navigating to Window->3D Visualization. The menu on the right has a few more plot configurations; of note is the scatter plot option. This buffer size dependence on the originating rank can also be seen in the communication matrix, which may be enabled (before run-time) by setting the environment variable TAU_COMM_MATRIX: % export TAU_COMM_MATRIX=1 3D Communication matrix From this graph it should be clear that the amount of communication between a node and its successor is increasing approximately in proportion to the square of the originating rank (the actual power is ). NWChem Here we focus on the scatter plot option on ParaProf. The idea of this viewer is to help developers identify groups of functions whose running time are tightly related. In the case of NWChem ( /Main_Page), this can be used to help develop a time model for the (new) static load balancing mechanism. The horizontal axis of this 3D scatter plot shows the exclusive execution time of the two dominant kernels in NWChem coupled cluster simulations (DGEMM and TCE_SORT). The vertical axis shows the time spent on dynamic load balancing. We see that smaller tasks which take less time in the kernels take more time 8 of :51

9 dynamic load balancing. Also, we see that the reason this occurs is possibly because of the size of the input data for the kernels - the red colored points correspond to large operations that take more time in the GA_Accumulate() operation compared to the green points. [4] ( /index.php/tau_by_example#endnote_citeozog) Hardware counters Clustering view for NWChem Like HPCToolkit, TAU can be built with PAPI support, which adds support for profiling branching and cache access patterns, time stalled waiting for resources (such as in memory reads), etc. The only change required in the configuration phase above is the addition of the argument -papi=, followed by the folder where PAPI was installed (say, /), e.g.: %./configure -mpilibrary='-lmpich -lmpl -lopa' -papi=/opt/papi (The same comment above about MPICH2 and --enable-shared applies here.) This will produce a new TAU makefile, Makefile.tau-papi-mpi-pdt, which should be used in the export TAU_MAKEFILE= commands above. Up to 25 counters/events can then be recorded by exporting the environment variables COUNTER1 through COUNTER25, as follows: % export COUNTER1=PAPI_TOT_CYC % export COUNTER2=PAPI_FML_INS % export COUNTER3=PAPI_FMA_INS Compilation and execution then proceeds exactly as usual. Instead of producing a set of profile.* files, TAU will generate one folder for each counter: % ls MULTI PAPI_TOT_CYC MULTI PAPI_FML_INS MULTI PAPI_FMA_INS... Paraprof can then be used to visualize the recorded metrics, as on the left, under the Windows/Thread submenu. For a simple use case, see the corresponding section on HPCToolkit. Notes 9 of :51

10 Counter statistics for HPCToolkit's matrix multiply example ^ Referred to as "binary rewriting" in TAU's manual. ^ The first two numbers are the dimensions for every rank's rectangle; the next two are the number of rectangles in the x and y dimensions (their product should be the number of ranks); the last is the number of initial perturbations in the wave (i.e. the circles in the image above). ^ To see the modifications made by the TAU script, you can add -optkeepfiles to the TAU_OPTIONS environment variable. The instrumented code will be written to PROGNAME.inst.c if PROGNAME is the executable's name, and similarly for C++. ^ Result from David Ozog, a PhD student from MCS/UOregon. External links Main TAU website ( TAU Wiki ( Original wave2d example ( /charm++/wave2d/) Retrieved from " oldid=2593" This page was last modified on 2 November 2012, at 09:39. This page has been accessed 2,985 times. 10 of :51

TAU 2.19 Quick Reference

TAU 2.19 Quick Reference What is TAU? The TAU Performance System is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, Java, Python. It comprises 3 main units: Instrumentation,