VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING. BSC Tools Hands-On. Germán Llort, Judit Giménez. Barcelona Supercomputing Center

Size: px

Start display at page:

Download "VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING. BSC Tools Hands-On. Germán Llort, Judit Giménez. Barcelona Supercomputing Center"

Anne Wright
5 years ago
Views:

1 BSC Tools Hands-On Germán Llort, Judit Giménez Barcelona Supercomputing Center

2 2 VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING Getting a trace with Extrae

3 Extrae features Platforms Intel, Cray, BlueGene, Intel MIC, ARM, Android, Fujitsu Sparc Parallel programming model MPI, OpenMP, pthreads, OmpSs, CUDA, OpenCL, Java, Python... Performance Counters Using PAPI interface Link to source code Callstack at MPI routines OpenMP outlined routines Selected user functions No need to recompile / relink! Periodic samples User events (Extrae API) 3

4 Extrae overheads Average values Archer Event ns ns Event + PAPI ns ns Event + callstack (1 level) 600 ns 540 ns Event + callstack (6 levels) 1.9 us 1.5 us 4

5 How does Extrae work? Symbol substitution through LD_PRELOAD Specific libraries for each combination of runtimes MPI OpenMP OpenMP+MPI Recommended Dynamic instrumentation Based on DynInst (developed by U.Wisconsin/U.Maryland) Instrumentation in memory Binary rewriting Static link (i.e., PMPI, Extrae API) 5

6 Linking in Archer Cray compilers link statically by default How make it dynamic? Add the flag dynamic Enables tracing with LD_PRELOAD method archer> [ cc CC ftn ]... -dynamic FOOTER (INSERT > HEADER AND FOOTER) 6

7 Problems with dynamic linking? Link statically against the tracing library (+ dependencies) Only supports MPI instrumentation Insert before the actual MPI library Extrae will always intercept the MPI calls Don t set LD_PRELOAD LDFLAGS += \ -L$EXTRAE_HOME/lib lmpitrace \ -L$BSCTOOLS_HOME/deps/binutils/2.24/lib lbfd liberty \ -L$BSCTOOLS_HOME/deps/libunwind/1.1/lib lunwind \ -L/opt/cray/papi/ /lib lpapi \ -L/usr/lib64 lxml \ -lrt lz -ldl FOOTER (INSERT > HEADER AND FOOTER) 7

8 Using Extrae in 3 steps 1. Adapt your job submission script 2. Configure what to trace XML configuration file Example configurations at $EXTRAE_HOME/share/example 3. Run it! For further reference check the Extrae User Guide: Also distributed with Extrae at $EXTRAE_HOME/share/doc 8

9 Login to Archer and copy the examples laptop> ssh Y <USER>@login.archer.ac.uk archer> cp r /work/y14/shared/bsctools/tools-material $WORK archer> ls $WORK/tools-material... apps/... clustering/... extrae/... slides/ Here you have a copy of this slides... traces/ 9

10 Step 1: Adapt the job script to load Extrae with LD_PRELOAD PIcomputer.sh archer> vi $WORK/tools-material/extrae/run_lulesh_27p.sh #!/bin/bash --login #PBS N LULESH2 #PBS l select=2 #PBS l walltime=00:05:00 #PBS A y14 Request resources module unload PrgEnv-cray PrgEnv-gnu module load PrgEnv-intel Change MPI version export PBS_O_WORKDIR=$(readlink f $PBS_O_WORKDIR) cd ${PBS_O_WORKDIR} export OMP_NUM_THREADS=1 aprun n 27 S 7../apps/lulesh Run the program 10

11 Step 1: Adapt the job script to load Extrae with LD_PRELOAD PIcomputer.sh archer> vi $WORK/tools-material/extrae/run_lulesh_27p.sh #!/bin/bash --login #PBS N LULESH2 #PBS l select=2 #PBS l walltime=00:05:00 #PBS A y14 module unload PrgEnv-cray PrgEnv-gnu module load PrgEnv-intel export PBS_O_WORKDIR=$(readlink f $PBS_O_WORKDIR) cd ${PBS_O_WORKDIR} export OMP_NUM_THREADS=1 export TRACE_NAME=lulesh_27p.prv aprun n 27 S 7./trace.sh../apps/lulesh Activate Extrae during the run 11

12 Step 1: Adapt the job script to load Extrae with LD_PRELOAD PIcomputer.sh archer> vi $WORK/tools-material/extrae/trace.sh #!/bin/bash --login #PBS N LULESH2 #PBS l select=2 #PBS l walltime=00:05:00 #PBS A y14 module unload PrgEnv-cray PrgEnv-gnu module load PrgEnv-intel export PBS_O_WORKDIR=$(readlink f $PBS_O_WORKDIR) cd ${PBS_O_WORKDIR} export OMP_NUM_THREADS=1 export TRACE_NAME=lulesh_27p.prv aprun n 27 S 7./trace.sh../apps/lulesh #!/bin/bash source /work/.../extrae/intel-mpich/etc/extrae.sh # Configure Extrae export EXTRAE_CONFIG_FILE=./extrae.xml # Load the tracing library (choose C/Fortran) export LD_PRELOAD=${EXTRAE_HOME}/lib/libmpitrace.so #export LD_PRELOAD=${EXTRAE_HOME}/lib/libmpitracef.so # Run the program $* Same MPI version as the application Select what to trace Select your type of application 12

13 Step 1: Which tracing library? Choose depending on the application type Library Serial MPI OpenMP pthread CUDA libseqtrace libmpitrace[f] 1 libomptrace libpttrace libcudatrace libompitrace[f] 1 libptmpitrace[f] 1 libcudampitrace[f] 1 1 include suffix f in Fortran codes 13

14 Step 3: Run it! Submit your job archer> cd $WORK/tools-material/extrae archer> qsub run_lulesh_27p.sh Once finished the trace will be in the same folder: lulesh_27p.{pcf,prv,row} (3 files) Check the status of your job with: qstat u $USER Any issue? Already generated at $WORK/tools-material/traces 14

15 Step 2: Extrae XML configuration archer> vi $WORK/tools-material/extrae/extrae.xml <mpi enabled="yes"> <counters enabled="yes" /> </mpi> <openmp enabled="yes"> <locks enabled="no" /> <counters enabled="yes" /> </openmp> Trace the MPI calls (What s the program doing?) <pthread enabled="no"> <locks enabled="no" /> <counters enabled="yes" /> </pthread> <callers enabled="yes"> <mpi enabled="yes">1-3</mpi> <sampling enabled="no">1-5</sampling> </callers> Trace the call-stack (Where in my code?) Compile with debug! (-g)

16 Step 2: Extrae XML configuration (II) <counters enabled="yes"> <cpu enabled="yes" starting-set-distribution="cyclic"> <set enabled="yes" changeat-time="500000us" domain="all > PAPI_TOT_INS, PAPI_TOT_CYC, PAPI_L1_DCM, PAPI_L3_TCM, PAPI_BR_INS, PAPI_L2_DCA </set> <set enabled="yes" changeat-time="500000us" domain="all"> PAPI_TOT_INS, PAPI_TOT_CYC, PAPI_SR_INS, RESOURCE_STALLS:ROB, RESOURCE_STALLS:RS </set> <set /set> </cpu> <network enabled="no" /> <resource-usage enabled="no" /> <memory-usage enabled="no" /> </counters> Select which HW counters are measured (How s the machine doing?)

17 Step 2: Extrae XML configuration (III) <buffer enabled="yes"> <size enabled="yes"> </size> <circular enabled="no" /> </buffer> Trace buffer size (Flush/memory trade-off) <sampling enabled="no" type="default" period="50m" variability="10m" /> <merge enabled= yes" synchronization="default" tree-fan-out="16" max-memory="512" joint-states="yes" keep-mpits="yes" sort-addresses="yes" overwrite="yes > $TRACE_NAME$ </merge> Automatic post-processing to generate the Paraver trace Enable sampling (Want more details?)

18 18 VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING Installing Paraver & First analysis steps

Install Paraver in your laptop Download from http://tools.bsc.es/downloads Pick your version Also available @Archer /work/y14/shared/bsctools/tools-packages wxparaver-4.7.2-win.zip wxparaver-4.7.2-mac.

19 Install Paraver in your laptop Download from Pick your version Also /work/y14/shared/bsctools/tools-packages wxparaver win.zip wxparaver mac.zip wxparaver linux_i686.tar.gz (32-bits) wxparaver linux_x86_64.tar.gz (64-bits) laptop> scp shared/bsctools/tools-packages/<package> $HOME 19

Install Paraver (II) Download tutorials: Documentation -> Tutorial guidelines Also available @Archer /work/y14/shared/bsctools/tools-packages laptop> scp

20 Install Paraver (II) Download tutorials: Documentation -> Tutorial guidelines Also /work/y14/shared/bsctools/tools-packages laptop> scp bsctools/tools-packages/paraver-tutorials tar.gz $HOME Download links FOOTER (INSERT > HEADER AND FOOTER) 20

Uncompress, rename & move Rename Uncompress

paraver folder into and tutorials paraver?

Contents Contents Resources Command-line

2-linux-x86_64.tar.gz laptop> mv wxparaver-4.

21 Uncompress, rename & move Rename Uncompress folders Drag tutorials both into packages paraver folder into and tutorials paraver? Destination is Right click Show Package Contents Contents Resources Command-line (Linux) laptop> tar xf wxparaver linux-x86_64.tar.gz laptop> mv wxparaver linux-x86_64 paraver laptop> tar xf paraver-tutorials tar.gz laptop> mv paraver-tutorials paraver/tutorials FOOTER (INSERT > HEADER AND FOOTER) 21

22 Check that everything works Start Paraver laptop> $HOME/paraver/bin/wxparaver & Check that tutorials are available Remotely available in Archer Click on Help Tutorials laptop> ssh Y <USER>@login.archer.ac.uk archer> /work/y14/shared/bsctools/wxparaver/latest/bin/wxparaver 22

uk:$WORK/tools-material/extrae/lulesh_27p.*.

23 First steps of analysis Copy the trace to your laptop ( All 3 files: *.prv, *.pcf, *.row ) laptop> scp <USER>@login.archer.ac.uk:$WORK/tools-material/extrae/lulesh_27p.*./ Load the trace Click on File Load Trace Browse to the *.prv file Follow Tutorial #3 Introduction to Paraver and Dimemas methodology Click on Help Tutorials 23

24 Measure the parallel efficiency Click on the mpi_stats.cfg Right click Zoom to skip initialization / finalization Click on Paste Time phases Open (drag Control & drop) Window Parallel efficiency Comm efficiency Load balance Right click Copy Time 24

Performance large burst from imbalance the initialization (zig-zag) (by

25 Computation time and work distribution Click on 2dh_usefulduration.cfg (2nd link) Shows time computing Work imbalance (zig-zag) Zoom to skip Performance large burst from imbalance the initialization (zig-zag) (by drag-and-dropping) and 2dh_useful_instructions.cfg (3rd link) Shows amount of work 25

same time Imbalance Click on Open Filtered Control

this area (by drag-and-dropping) 1 of the iterations

Fit both Right click Copy Right click Paste Time

26 Where does this happen? GoSlow from the & table Fast to at the the timeline same time Imbalance Click on Open Filtered Control Window Zoom into Hints Callers Caller function Select this area (by drag-and-dropping) 1 of the iterations (by drag-and-dropping) Right click Fit Semantic Scale Fit both Right click Copy Right click Paste Time Hidden values (click to show) CommSend CommMonoQ TimeIncrement FOOTER (INSERT > HEADER AND FOOTER) 26

27 Save CFG s (2 methods) Right click on timeline 1. Main Paraver window 2. Select 3. Save FOOTER (INSERT > HEADER AND FOOTER) 27

28 CFG s distribution Paraver comes with many more included CFG s FOOTER (INSERT > HEADER AND FOOTER) 28

29 Hints: a good place to start! Paraver suggests CFG s based on the information present in the trace FOOTER (INSERT > HEADER AND FOOTER) 29

30 30 VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING Cluster-based analysis

31 Use clustering analysis Run clustering laptop> ssh Y <USER>@login.archer.ac.uk archer> cd $WORK/tools-material/clustering archer> /work/y14/shared/bsctools/clustering/2.6.6/bin/burstclustering -d cluster.xml -i../extrae/lulesh_27p.prv -o lulesh_27p_clustered.prv If you didn t get your own trace, use a prepared one from: archer> ls $WORK/tools-material/traces/lulesh_27p.prv 31

32 Variable work VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING Cluster-based analysis Check the resulting scatter plot archer> gnuplot lulesh_27p_clustered.ipc.papi_tot_ins.gnuplot Identify main computing trends Work (Y) vs. Speed (X) Look at the clusters shape Variability in both axes indicate potential imbalances Variable speed 32

prv Display the distribution of clusters over time File Load configuration

33 Correlating scatter plot and time distribution Copy the clustered trace to your laptop and look at it laptop> $HOME/paraver/bin/wxparaver <path-to>/lulesh_27p_clustered.prv Display the distribution of clusters over time File Load configuration $HOME/paraver/cfgs/clustering/clusterID_window.cfg Variable work / speed + different processes = Imbalances 33

BSC Tools Hands-On. Judit Giménez, Lau Mercadal Barcelona Supercomputing Center

BSC Tools Hands-On. Judit Giménez, Lau Mercadal Barcelona Supercomputing Center BSC Tools Hands-On Judit Giménez, Lau Mercadal (lau.mercadal@bsc.es) Barcelona Supercomputing Center 2 VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING Extrae Extrae features Parallel programming models