Introduction to the Intel tracing tools

Size: px
Start display at page:

Download "Introduction to the Intel tracing tools"

Transcription

1 Introduction to the Intel tracing tools R. Bader (LRZ) M. Deilmann (Intel)

2 Help - my parallel program doesn t scale! Isolate performance issues in parallel programs... especially those in large MPI programs inefficient MPI programming bottlenecks latency-dominated performance load imbalance deadlock (hmmm...)... but also those via subroutine time consumption Existing facilities gprof for subroutines PMPI profiling interface (MPI standard!) + upshot... have problems with large parallel programs: Trace files cannot be easily analyzed 2

3 Intel Tracing Tools Solve the above problem: integration of subroutine and MPI profiling originally developed by ZI für Mathematik/FZ Jülich marketed by Pallas, now Intel Platforms x86 (Solaris & Linux) IPF, (EM64T) (Linux) Future development on non-intel platforms unclear Vampir NG (FZ Jülich, ZIH Dresden) Supercomputers: Altix 3700 and 4700 NEC SX-4/5/6/8 CRAY Version installed at LRZ at RRZE Campus license GUI Version 7.1 Tracing Library Version 7.1 not available trace libraries are rather expensive 3

4 Documentation and resources Intel Web Site Note: name mapping VAMPIR Intel trace analyzer VAMPIRtrace Intel trace collector LRZ Web Site see especially the links to the user s guides at the end of this document specific Usage instructions for each platform the tracing libraries are available on 4

5 Basic usage of tracing libraries and GUI

6 Two Components of ITT 1. Instrumentation 2. Visualization MPI library calls: divert to profiling interface Achieved by specifying vtrace switch to mpif90/mpicc commands this is a LRZ-specific setting make things work uniformly across platforms Intel MPI: -t=log instead In most cases, relinking sufficient Additional functionality subroutine tracing switching tracing on and off requires source code changes and recompilation during/after program run tracefile is written to disk may be visualized on any platform using the ITA GUI many views of the trace data available via a transparent menu structure hundreds of MPI processes viewable with good performance STF (Structured Trace File) format still lots of data 6

7 Step 1: Instrument your Code Example: MPI heat conduction example Uses non-blocking sends and receives as well as reduction operations sets up MPI data types and topology (MPI_CART_CREATE) set up environment: module load mpi_tracing recompile code (for inclusion of subroutine tracing): mpif90 -vtrace -c <lots_of_options> <source_file> relink binary: mpif90 -vtrace -o heat_mpi.exe <lots_of_objects> Warning: do not specify MPI libraries at your own discretion ruins link sequence set up by vtrace! this is the name of our application 7

8 Step 2: Prepare Configuration File Edit a file vt_heat to contain the following lines: # Log file LOGFILE-NAME heat.stf LOGFILE-FORMAT STF # disable all MPI activity ACTIVITY MPI OFF # enable all bcasts, recvs and sends SYMBOL MPI_WAITALL ON SYMBOL MPI_IRECV ON SYMBOL MPI_ISEND ON SYMBOL MPI_BARRIER ON SYMBOL MPI_ALLREDUCE ON # enable all activities in the Application class ACTIVITY Application ON 8

9 Step 3: Run the application Set up environment: module load mpi_tracing export VT_CONFIG=vt_heat Run the executable: mpirun -np 4./heat_mpi.exe (if not yet done) Upon completion you should get the message [0] Intel Trace Collector INFO: Writing tracefile heat.stf in /home/cluster/a2832ba/kurse/parprog_2005/vt/heat and a number of files heat.stf* should exist. To save disk space, convert to stfsingle: stftool heat.stf --convert - --logfile-format STFSINGLE \ gzip -c > heat.stfsingle.gz Do not use this format for viewing! 9

10 Step 4: Start the GUI X11R5 or higher is needed Is X Authentication working (Use ssh X...)? Now enter traceanalyzer pmatmul.stf The Qt based Trace Analyzer GUI should now start up, showing the following widgets: Main window Inside main window: panel referring to the presently analyzed tracefile there can be more than one panel/tracefile in the main window at any time 10

11 And this is what things look like at start: Chart: Function profile Flat profile 11

12 Overall load balance times inclusive traced subroutine calls 12

13 Resolving the MPI calls right klick Ungroup Group MPI can sort by category Times for each MPI routine specified in config file 13

14 Call tree Notes: call tree can only resolve activities switched on for tracing By default, User code not resolvable unless automatic or manual subroutine tracing compiled in (automatic subroutine tracing not yet available on Intel) 14

15 Call tree split up into user processes Select Children of Group All_Processes can also select expansion into various groups, two of which are defined at least: MPI Application (dotted circle click) 15

16 ... call up function group editor Select e.g., one of the major function groups and press OK You will then essentially filter out all other groups from the view... 16

17 ... like so: 17

18 Now a different view: Time-Line of all MPI processes Select Charts Event Timeline this will open an additional pane in the tracing subwindow then zoom in to some region of interest by using the left mouse button do this repeatedly to obtain... 18

19 ... this small, second trace section red parts: (different shades) are the various MPI activities blue parts: User code black lines: communication, left click for context menu message properties... to obtain further information about this message Note that profile window (below) adjusts itself to selected timeslice! 19

20 Next Chart option: Qualitative Timeline This will give cosynchronous information for e.g., transfer rates other quantities selectable via context menu: transfer duration transfer volume selectable events can be function events messages (as here) collective operations 20

21 Quantitative Timeline: Accumulated measure of activities This answers the question: How many CPUs are presently engaged in each activity? Example: Yellow arrow shows interval where application does communication exclusively Note: can remove activities via the context menu if too cluttered 21

22 The final Chart type: Message profile gives you metrics for message volume throughput time count senders (horizontal) vs. receivers (vertical) Note: color codes enable you to easily find sore points non-dense communication patterns are good 22

23 Some advanced features: The Vampirtrace API Controlling tracefile output MPI message checking

24 User-Level instrumentation: The ITC API (1) Subroutine library: Control profiling process from within the program Define own activity classes and their members Define performance counters (not discussed here) Include Files Fortran: VT.inc C: VT.h Fortran calls differ from C calls In this presentation, reference to Fortran calls refer to user s guide for further information VT API has changed over time May need to update existing instrumentation for newer releases 24

25 User-Level instrumentation: The ITC API (2) initialize / finalize VTINIT(ierr) is automatically called from MPI_INIT(...) for tracing of non-mpi programs: explicit call is required use libvtsp.a if available VTFINI(ierr) called from MPI_FINALIZE(...) control: VTTRACEON(), VTTRACEOFF() switch tracing on/off VTFLUSH(ierr) write memory to flush file default integers user-defined states group subroutine calls into class of activities... call VTCLASSDEF( & & 'mylib',mylib_handle,ierr)... and then add symbols for each state (subroutine?) call VTFUNCDEF( & & 'mystate1',mylib_handle, & & mystate1_handle,ierr) (etc.)... until group is complete 25

26 User-Level instrumentation: The ITC API (3) Actual measurement start with call VTBEGIN( & & mystate1_handle,ierr) stop with call VTEND( & & mystate1_handle,ierr) nested calls are possible... but no overlaps! name_handle global variable program prog include 'VT.inc'... [ call vttraceoff() ] call mpi_init(...) call vtclassdef( & 'mylib',mylib_handle,ierr) call vtfuncdef('name', & mylib_handle,name_handle,ierr) [ call vttraceon(ierr) ]... subroutine name(...) include 'VT.inc' (declarations) call vtbegin(name_handle,ierr) (executable statements) call vtend(name_handle,ierr) 26

27 Controlling tracefile output (1) Environment Variables: VT_CONFIG: Name of configuration file VT_CONFIG_RANK: Rank of MPI process reading configuration file writing trace file Default is MPI process ID 0 Always specify name and format of log file in configuration file! Change rank of process writing trace file: LOGFILE-RANK <number> entry in configuration file Tracefile production: trace data are stored in memory buffers, possibly controlled by configuration entries: x MEM-BLOCKSIZE: size of buffers MEM-MAXBLOCKS: maximum number of memory buffers VT should not use too much memory (may disrupt application!) Exhaustion of memory buffers can happen after some time What to do? Applications which hang or crash? 64 kbyte 0 = unlimited 27

28 Controlling tracefile output (2) coping with buffer overrun a) Flush data to disk: default action (AUTOFLUSH is on) possibly use suitable MEM-FLUSHBLOCKS value to trigger background flushing b) Overwrite from beginning: only last part of tracefile written. AUTOFLUSH off MEM-OVERWRITE on c) Stop trace collection only first part of tracefile written AUTOFLUSH off MEM-OVERWRITE off (default) coping with tracefile size Long runs may produce tens (or hundreds) GBytes! STF reduces difficulties with visualizing this Frame definition (not discussed here) Reduction of size: Insert VTtraceon() and VTtraceoff() calls Use activity/symbol filtering via configuration file (e. g., no administrative MPI calls) Use b) or c) shown left Trace subset of processes PROCESS 0:N off PROCESS 0:N:3 on 28

29 Controlling tracefile output (3) Treatment of crashing applications In this case: replace the vtrace switch by vtrace_fs ( failsafe tracing ) presently available only for Fortran on the Altix Failures: replacement of tracing library Signals (SIGINT, SIGTERM) SIGKILL not caught Premature exit of processes without MPI_Finalize MPI error (comm. problems, wrong parameters) What is done? freezes MPI processes writing done via TCP sockets SIGINT is sent after writing data 30

30 Controlling tracefile output (4) Treatment of hanging applications Deadlock detection is automatically performed If ITC observes no progress for a certain amount of time in any process, then it assumes a deadlock stops the application and writes a trace file. Timeout is configurable via DEADLOCK-TIMEOUT no progress is defined as inside the same MPI call - obviously this is just a heuristic approach and may fail: If all processes remain in MPI for a long time e.g. due to a long data transfer, then the timeout might be reached premature abort default timeout is 5 minutes 31

31 Further functionality Recording source location enables you to dive into source from message line or function in the GUI potentially very large performance overhead presently only for GCC based MPI Automatic subroutine tracing class application can be ungrouped no manual instrumentation needed Use the tcollect switch in addition to vtrace Intel compiler 10.0 or higher is needed 32

32 MPI message checking Error detection for MPI code also makes use of MPI profiling interface supported module stack on LRZ systems: module unload mpi_tracing module unload mpi.parastation mpi.altix # may need to unload further modules module load mpi.intel module load mpi_tracing completely recompile application No static linkage Run executable with LD_PRELOAD set: mpiexec -genv LD_PRELOAD libvtmc.so \ -n [No. of tasks]./myprog.exe Run executable with LD_PRELOAD set: report is written to stderr check lines marked ERROR or WARNING Further environment variables for execution control VT_DEADLOCK_TIMEOUT default 60 sec VT_DEADLOCK_WARNING default 300 s VT_CHECK_MAX_ERRORS default 1 VT_CHECK_MAX_REPORTS default 0 (unlimited) use genv to propagate 33

33 Message checking: Example output for insufficient buffering [0] INFO: CHECK LOCAL:EXIT:SIGNAL ON many more info lines [0] ERROR: LOCAL:BUFFER:INSUFFICIENT_BUFFER: error [0] ERROR: Buffer [0x , 0x e[ of size 110 cannot store message of size 111. [0] ERROR: Free space [0x , 0x e[, 110 bytes. [0] ERROR: Check buffer handling (use larger buffer in MPI_Buffer_attach(), [0] ERROR: receive the oldest message(s) to free up space before buffering new ones, [0] ERROR: check for race conditions between buffering and receiving messages, etc). [0] ERROR: Note that message sizes are calculated using the worst-case scenario that [0] ERROR: the application has to be prepared for: MPI_Pack_size() + MPI_BSEND_OVERHEAD. [0] ERROR: New message of size 111 was to be sent by: [0] ERROR: MPI_Bsend(*buf=0x , count=16, datatype=mpi_char, dest=1, tag=100, comm=mpi_comm_world) [0] ERROR: testpairs (/home/cluster/a2832ba/size.c:85) [0] ERROR: wrongbuffer (/home/cluster/a2832ba/size.c:358) [0] ERROR: main (/home/cluster/a2832ba/size.c:378) [0] ERROR: libc_start_main (/lib/libc-2.4.so) [0] ERROR: _start (/home/cluster/a2832ba/size) [0] INFO: 1 error, limit CHECK-MAX-ERRORS reached => aborting points out source lines and faulty MPI call provides recommendations on fixing the problem 34

34 MPI checking: Final remarks further possible environment settings: deduce from <program>.prot deadlock detection for large number of MPI tasks possibly need to readjust timeouts else large messages may provoke false negative Documentation check the Intel Trace Collector User s Guide on x86: can also integrate Valgrind for memory checking 35

Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr

Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr Research Centre Juelich (FZJ) John von Neumann Institute of Computing (NIC) Central Institute for Applied Mathematics (ZAM) 52425

More information

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE Tutorial: Analyzing MPI Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Contents Legal Information... 3 1. Overview... 4 1.1. Prerequisites... 5 1.1.1. Required Software... 5 1.1.2.

More information

Optimization of MPI Applications Rolf Rabenseifner

Optimization of MPI Applications Rolf Rabenseifner Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization

More information

Vampirtrace 2.0. Installation and User s Guide. Pallas GmbH Hermülheimer Straße 10 D Brühl, Germany

Vampirtrace 2.0. Installation and User s Guide. Pallas GmbH Hermülheimer Straße 10 D Brühl, Germany Vampirtrace 2.0 Installation and User s Guide Pallas GmbH Hermülheimer Straße 10 D 50321 Brühl, Germany http://www.pallas.com Insatllation and User s Guide Vampir was originally developed by the Zentralinstitut

More information

Performance properties The metrics tour

Performance properties The metrics tour Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de August 2012 Scalasca analysis result Online description Analysis report explorer

More information

Performance properties The metrics tour

Performance properties The metrics tour Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de January 2012 Scalasca analysis result Confused? Generic metrics Generic metrics Time

More information

Performance properties The metrics tour

Performance properties The metrics tour Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de Scalasca analysis result Online description Analysis report explorer GUI provides

More information

Scalasca performance properties The metrics tour

Scalasca performance properties The metrics tour Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware

More information

Elementary Parallel Programming with Examples. Reinhold Bader (LRZ) Georg Hager (RRZE)

Elementary Parallel Programming with Examples. Reinhold Bader (LRZ) Georg Hager (RRZE) Elementary Parallel Programming with Examples Reinhold Bader (LRZ) Georg Hager (RRZE) Two Paradigms for Parallel Programming Hardware Designs Distributed Memory M Message Passing explicit programming required

More information

Parallel Performance Analysis Using the Paraver Toolkit

Parallel Performance Analysis Using the Paraver Toolkit Parallel Performance Analysis Using the Paraver Toolkit Parallel Performance Analysis Using the Paraver Toolkit [16a] [16a] Slide 1 University of Stuttgart High-Performance Computing Center Stuttgart (HLRS)

More information

Scalasca performance properties The metrics tour

Scalasca performance properties The metrics tour Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware

More information

Vampir 2.0. Tutorial. Pallas GmbH Hermülheimer Straße 10 D Brühl, Germany

Vampir 2.0. Tutorial. Pallas GmbH Hermülheimer Straße 10 D Brühl, Germany Vampir 2.0 Tutorial Pallas GmbH Hermülheimer Straße 10 D-50321 Brühl, Germany http://www.pallas.com Vampir was originally developed by the Zentralinstitut für Mathematik at the Forschungszentrum Jülich.

More information

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Parallelism Decompose the execution into several tasks according to the work to be done: Function/Task

More information

Introduction to Parallel Performance Engineering

Introduction to Parallel Performance Engineering Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance:

More information

Performance analysis basics

Performance analysis basics Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis

More information

Automatic trace analysis with the Scalasca Trace Tools

Automatic trace analysis with the Scalasca Trace Tools Automatic trace analysis with the Scalasca Trace Tools Ilya Zhukov Jülich Supercomputing Centre Property Automatic trace analysis Idea Automatic search for patterns of inefficient behaviour Classification

More information

For Distributed Performance

For Distributed Performance For Distributed Performance Intel Parallel Studio XE 2017 development suite Empowering Faster Code Faster Delivering HPC Development Solutions Over 20 years Industry Collaboration on Standards PARALLELISM

More information

Vampir 9 User Manual

Vampir 9 User Manual Vampir 9 User Manual Copyright c 2018 GWT-TUD GmbH Freiberger Str. 33 01067 Dresden, Germany http://gwtonline.de Support / Feedback / Bug Reports Please provide us feedback! We are very interested to hear

More information

MPI Runtime Error Detection with MUST

MPI Runtime Error Detection with MUST MPI Runtime Error Detection with MUST At the 27th VI-HPS Tuning Workshop Joachim Protze IT Center RWTH Aachen University April 2018 How many issues can you spot in this tiny example? #include #include

More information

An introduction to MPI

An introduction to MPI An introduction to MPI C MPI is a Library for Message-Passing Not built in to compiler Function calls that can be made from any compiler, many languages Just link to it Wrappers: mpicc, mpif77 Fortran

More information

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC) Parallel Algorithms on a cluster of PCs Ian Bush Daresbury Laboratory I.J.Bush@dl.ac.uk (With thanks to Lorna Smith and Mark Bull at EPCC) Overview This lecture will cover General Message passing concepts

More information

Introduction to the Message Passing Interface (MPI)

Introduction to the Message Passing Interface (MPI) Introduction to the Message Passing Interface (MPI) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction to the Message Passing Interface (MPI) Spring 2018

More information

Instrumentation. BSC Performance Tools

Instrumentation. BSC Performance Tools Instrumentation BSC Performance Tools Index The instrumentation process A typical MN process Paraver trace format Configuration XML Environment variables Adding references to the source API CEPBA-Tools

More information

MPI Application Development with MARMOT

MPI Application Development with MARMOT MPI Application Development with MARMOT Bettina Krammer University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Matthias Müller University of Dresden Centre for Information

More information

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1 Score-P SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1 Score-P Functionality Score-P is a joint instrumentation and measurement system for a number of PA tools. Provide

More information

Intel Trace Collector Reference Guide

Intel Trace Collector Reference Guide Intel Trace Collector Reference Guide Contents 1 Introduction 4 1.1 What is the Intel Trace Collector?................................ 4 1.2 System Requirements and Supported Features........................

More information

Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector

Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector A brief Introduction to MPI 2 What is MPI? Message Passing Interface Explicit parallel model All parallelism is explicit:

More information

MPI Correctness Checking with MUST

MPI Correctness Checking with MUST Center for Information Services and High Performance Computing (ZIH) MPI Correctness Checking with MUST Parallel Programming Course, Dresden, 8.- 12. February 2016 Mathias Korepkat (mathias.korepkat@tu-dresden.de

More information

Intel Trace Analyzer. User and Reference Guide

Intel Trace Analyzer. User and Reference Guide Intel Trace Analyzer User and Reference Guide Contents Legal Information... 4 1. Introduction... 5 1.1. What's New... 5 1.2. About this Document... 5 1.3. Related Information... 6 2. User Guide... 7 2.1.

More information

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data Message Passing as a Programming Paradigm Gentle Introduction to MPI Point-to-point Communication Message Passing

More information

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW 8th VI-HPS Tuning Workshop at RWTH Aachen September, 2011 Tobias Hilbrich and Joachim Protze Slides by: Andreas Knüpfer, Jens Doleschal, ZIH, Technische Universität

More information

Typical Bugs in parallel Programs

Typical Bugs in parallel Programs Center for Information Services and High Performance Computing (ZIH) Typical Bugs in parallel Programs Parallel Programming Course, Dresden, 8.- 12. February 2016 Joachim Protze (protze@rz.rwth-aachen.de)

More information

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Using Intel VTune Amplifier XE and Inspector XE in.net environment Using Intel VTune Amplifier XE and Inspector XE in.net environment Levent Akyil Technical Computing, Analyzers and Runtime Software and Services group 1 Refresher - Intel VTune Amplifier XE Intel Inspector

More information

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam Clemens Grelck University of Amsterdam UvA / SurfSARA High Performance Computing and Big Data Course June 2014 Parallel Programming with Compiler Directives: OpenMP Message Passing Gentle Introduction

More information

Vampir 8 User Manual

Vampir 8 User Manual Vampir 8 User Manual Copyright c 2013 GWT-TUD GmbH Blasewitzer Str. 43 01307 Dresden, Germany http://gwtonline.de Support / Feedback / Bugreports Please provide us feedback! We are very interested to hear

More information

SCALASCA parallel performance analyses of SPEC MPI2007 applications

SCALASCA parallel performance analyses of SPEC MPI2007 applications Mitglied der Helmholtz-Gemeinschaft SCALASCA parallel performance analyses of SPEC MPI2007 applications 2008-05-22 Zoltán Szebenyi Jülich Supercomputing Centre, Forschungszentrum Jülich Aachen Institute

More information

Performance Analysis of Parallel Applications Using LTTng & Trace Compass

Performance Analysis of Parallel Applications Using LTTng & Trace Compass Performance Analysis of Parallel Applications Using LTTng & Trace Compass Naser Ezzati DORSAL LAB Progress Report Meeting Polytechnique Montreal Dec 2017 What is MPI? Message Passing Interface (MPI) Industry-wide

More information

Practical Introduction to Message-Passing Interface (MPI)

Practical Introduction to Message-Passing Interface (MPI) 1 Outline of the workshop 2 Practical Introduction to Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Theoretical / practical introduction Parallelizing your

More information

Understanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E

Understanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E Understanding Communication and MPI on Cray XC40 Features of the Cray MPI library Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Well tested code for high level

More information

Intel VTune Amplifier XE

Intel VTune Amplifier XE Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance

More information

Introduction to parallel computing with MPI

Introduction to parallel computing with MPI Introduction to parallel computing with MPI Sergiy Bubin Department of Physics Nazarbayev University Distributed Memory Environment image credit: LLNL Hybrid Memory Environment Most modern clusters and

More information

CS533 Modeling and Performance Evaluation of Network and Computer Systems

CS533 Modeling and Performance Evaluation of Network and Computer Systems CS533 Modeling and Performance Evaluation of Network and Computer Systems Monitors (Chapter 7) 1 Monitors That which is monitored improves. Source unknown A monitor is a tool used to observe system Observe

More information

Script Step Reference Information

Script Step Reference Information Script Step Reference Information This chapter lists all the steps available for use in creating scripts. These steps are accessed using the palette pane (see Using the Palette Pane, page 8). This chapter

More information

EZTrace upcoming features

EZTrace upcoming features EZTrace 1.0 + upcoming features François Trahay francois.trahay@telecom-sudparis.eu 2015-01-08 Context Hardware is more and more complex NUMA, hierarchical caches, GPU,... Software is more and more complex

More information

Monitoring Power CrayPat-lite

Monitoring Power CrayPat-lite Monitoring Power CrayPat-lite (Courtesy of Heidi Poxon) Monitoring Power on Intel Feedback to the user on performance and power consumption will be key to understanding the behavior of an applications

More information

Performance analysis with Periscope

Performance analysis with Periscope Performance analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität petkovve@in.tum.de March 2010 Outline Motivation Periscope (PSC) Periscope performance analysis

More information

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap Writing your first MPI program. Using the common MPI functions. The Trapezoidal Rule in

More information

CS 6230: High-Performance Computing and Parallelization Introduction to MPI

CS 6230: High-Performance Computing and Parallelization Introduction to MPI CS 6230: High-Performance Computing and Parallelization Introduction to MPI Dr. Mike Kirby School of Computing and Scientific Computing and Imaging Institute University of Utah Salt Lake City, UT, USA

More information

BSC Tools Hands-On. Judit Giménez, Lau Mercadal Barcelona Supercomputing Center

BSC Tools Hands-On. Judit Giménez, Lau Mercadal Barcelona Supercomputing Center BSC Tools Hands-On Judit Giménez, Lau Mercadal (lau.mercadal@bsc.es) Barcelona Supercomputing Center 2 VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING Extrae Extrae features Parallel programming models

More information

Using CrayPAT and Apprentice2: A Stepby-step

Using CrayPAT and Apprentice2: A Stepby-step Using CrayPAT and Apprentice2: A Stepby-step guide Cray Inc. (2014) Abstract This tutorial introduces Cray XC30 users to the Cray Performance Analysis Tool and its Graphical User Interface, Apprentice2.

More information

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap Writing your first MPI program. Using the common MPI functions. The Trapezoidal Rule in

More information

OpenACC Support in Score-P and Vampir

OpenACC Support in Score-P and Vampir Center for Information Services and High Performance Computing (ZIH) OpenACC Support in Score-P and Vampir Hands-On for the Taurus GPU Cluster February 2016 Robert Dietrich (robert.dietrich@tu-dresden.de)

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

Vampir 8 User Manual

Vampir 8 User Manual Vampir 8 User Manual Copyright c 2012 GWT-TUD GmbH Blasewitzer Str. 43 01307 Dresden, Germany http://gwtonline.de Support / Feedback / Bugreports Please provide us feedback! We are very interested to hear

More information

Chapter 3. Distributed Memory Programming with MPI

Chapter 3. Distributed Memory Programming with MPI An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap n Writing your first MPI program. n Using the common MPI functions. n The Trapezoidal Rule

More information

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications Center for Information Services and High Performance Computing (ZIH) Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications The Fourth International Workshop on Accelerators and Hybrid Exascale

More information

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. Munara Tolubaeva Technical Consulting Engineer 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. notices and disclaimers Intel technologies features and benefits depend

More information

VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING. BSC Tools Hands-On. Germán Llort, Judit Giménez. Barcelona Supercomputing Center

VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING. BSC Tools Hands-On. Germán Llort, Judit Giménez. Barcelona Supercomputing Center BSC Tools Hands-On Germán Llort, Judit Giménez Barcelona Supercomputing Center 2 VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING Getting a trace with Extrae Extrae features Platforms Intel, Cray, BlueGene,

More information

Accelerate HPC Development with Allinea Performance Tools

Accelerate HPC Development with Allinea Performance Tools Accelerate HPC Development with Allinea Performance Tools 19 April 2016 VI-HPS, LRZ Florent Lebeau / Ryan Hulguin flebeau@allinea.com / rhulguin@allinea.com Agenda 09:00 09:15 Introduction 09:15 09:45

More information

AMath 483/583 Lecture 21

AMath 483/583 Lecture 21 AMath 483/583 Lecture 21 Outline: Review MPI, reduce and bcast MPI send and receive Master Worker paradigm References: $UWHPSC/codes/mpi class notes: MPI section class notes: MPI section of bibliography

More information

Performance Analysis with Vampir. Joseph Schuchart ZIH, Technische Universität Dresden

Performance Analysis with Vampir. Joseph Schuchart ZIH, Technische Universität Dresden Performance Analysis with Vampir Joseph Schuchart ZIH, Technische Universität Dresden 1 Mission Visualization of dynamics of complex parallel processes Full details for arbitrary temporal and spatial levels

More information

Using Lamport s Logical Clocks

Using Lamport s Logical Clocks Fast Classification of MPI Applications Using Lamport s Logical Clocks Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan Florida State University Los Alamos National Laboratory 1 Motivation Conventional trace-based

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Performance Analysis of Parallel Scientific Applications In Eclipse

Performance Analysis of Parallel Scientific Applications In Eclipse Performance Analysis of Parallel Scientific Applications In Eclipse EclipseCon 2015 Wyatt Spear, University of Oregon wspear@cs.uoregon.edu Supercomputing Big systems solving big problems Performance gains

More information

MPI Runtime Error Detection with MUST

MPI Runtime Error Detection with MUST MPI Runtime Error Detection with MUST At the 25th VI-HPS Tuning Workshop Joachim Protze IT Center RWTH Aachen University March 2017 How many issues can you spot in this tiny example? #include #include

More information

Implementation of Parallelization

Implementation of Parallelization Implementation of Parallelization OpenMP, PThreads and MPI Jascha Schewtschenko Institute of Cosmology and Gravitation, University of Portsmouth May 9, 2018 JAS (ICG, Portsmouth) Implementation of Parallelization

More information

Assessment of LS-DYNA Scalability Performance on Cray XD1

Assessment of LS-DYNA Scalability Performance on Cray XD1 5 th European LS-DYNA Users Conference Computing Technology (2) Assessment of LS-DYNA Scalability Performance on Cray Author: Ting-Ting Zhu, Cray Inc. Correspondence: Telephone: 651-65-987 Fax: 651-65-9123

More information

Holland Computing Center Kickstart MPI Intro

Holland Computing Center Kickstart MPI Intro Holland Computing Center Kickstart 2016 MPI Intro Message Passing Interface (MPI) MPI is a specification for message passing library that is standardized by MPI Forum Multiple vendor-specific implementations:

More information

DEBUGGING: DYNAMIC PROGRAM ANALYSIS

DEBUGGING: DYNAMIC PROGRAM ANALYSIS DEBUGGING: DYNAMIC PROGRAM ANALYSIS WS 2017/2018 Martina Seidl Institute for Formal Models and Verification System Invariants properties of a program must hold over the entire run: integrity of data no

More information

Vampir 4.0. User s Guide PALLAS GmbH Hermülheimer Straße 10 D Brühl, Germany

Vampir 4.0. User s Guide PALLAS GmbH Hermülheimer Straße 10 D Brühl, Germany Vampir 4.0 User s Guide 4.0.0.0 PALLAS GmbH Hermülheimer Straße 10 D 50321 Brühl, Germany This product includes software developed by the University of California, Berkley and its contributors, and software

More information

An Introduction to MPI

An Introduction to MPI An Introduction to MPI Parallel Programming with the Message Passing Interface William Gropp Ewing Lusk Argonne National Laboratory 1 Outline Background The message-passing model Origins of MPI and current

More information

AutoTune Workshop. Michael Gerndt Technische Universität München

AutoTune Workshop. Michael Gerndt Technische Universität München AutoTune Workshop Michael Gerndt Technische Universität München AutoTune Project Automatic Online Tuning of HPC Applications High PERFORMANCE Computing HPC application developers Compute centers: Energy

More information

HPCC - Hrothgar. Getting Started User Guide TotalView. High Performance Computing Center Texas Tech University

HPCC - Hrothgar. Getting Started User Guide TotalView. High Performance Computing Center Texas Tech University HPCC - Hrothgar Getting Started User Guide TotalView High Performance Computing Center Texas Tech University HPCC - Hrothgar 2 Table of Contents *This user guide is under development... 3 1. Introduction...

More information

Message Passing Interface (MPI) on Intel Xeon Phi coprocessor

Message Passing Interface (MPI) on Intel Xeon Phi coprocessor Message Passing Interface (MPI) on Intel Xeon Phi coprocessor Special considerations for MPI on Intel Xeon Phi and using the Intel Trace Analyzer and Collector Gergana Slavova gergana.s.slavova@intel.com

More information

Performance analysis : Hands-on

Performance analysis : Hands-on Performance analysis : Hands-on time Wall/CPU parallel context gprof flat profile/call graph self/inclusive MPI context VTune hotspots, per line profile advanced metrics : general exploration, parallel

More information

Characterizing Imbalance in Large-Scale Parallel Programs. David Bo hme September 26, 2013

Characterizing Imbalance in Large-Scale Parallel Programs. David Bo hme September 26, 2013 Characterizing Imbalance in Large-Scale Parallel Programs David o hme September 26, 2013 Need for Performance nalysis Tools mount of parallelism in Supercomputers keeps growing Efficient resource usage

More information

Parallel Code Optimisation

Parallel Code Optimisation April 8, 2008 Terms and terminology Identifying bottlenecks Optimising communications Optimising IO Optimising the core code Theoretical perfomance The theoretical floating point performance of a processor

More information

Performance Analysis with Periscope

Performance Analysis with Periscope Performance Analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München periscope@lrr.in.tum.de October 2010 Outline Motivation Periscope overview Periscope performance

More information

OPT User Guide. Version 1.4

OPT User Guide. Version 1.4 Version 1.4 Table of Contents 1 Introduction...1 1.1 Reading this User Guide...2 2 Preparing Your Application...4 2.1 MPI Profiling...4 2.1.1 Supported MPI Libraries...4 2.1.2 Compiling For Shared MPI

More information

ECE 574 Cluster Computing Lecture 13

ECE 574 Cluster Computing Lecture 13 ECE 574 Cluster Computing Lecture 13 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 March 2017 Announcements HW#5 Finally Graded Had right idea, but often result not an *exact*

More information

Parallel Debugging. Matthias Müller, Pavel Neytchev, Rainer Keller, Bettina Krammer

Parallel Debugging. Matthias Müller, Pavel Neytchev, Rainer Keller, Bettina Krammer Parallel Debugging Matthias Müller, Pavel Neytchev, Rainer Keller, Bettina Krammer University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Höchstleistungsrechenzentrum Stuttgart

More information

Introducing OTF / Vampir / VampirTrace

Introducing OTF / Vampir / VampirTrace Center for Information Services and High Performance Computing (ZIH) Introducing OTF / Vampir / VampirTrace Zellescher Weg 12 Willers-Bau A115 Tel. +49 351-463 - 34049 (Robert.Henschel@zih.tu-dresden.de)

More information

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011. CS4961 Parallel Programming Lecture 16: Introduction to Message Passing Administrative Next programming assignment due on Monday, Nov. 7 at midnight Need to define teams and have initial conversation with

More information

SCALASCA v1.0 Quick Reference

SCALASCA v1.0 Quick Reference General SCALASCA is an open-source toolset for scalable performance analysis of large-scale parallel applications. Use the scalasca command with appropriate action flags to instrument application object

More information

Intel Trace Collector User and Reference Guide

Intel Trace Collector User and Reference Guide Intel Trace Collector 2017 User and Reference Guide Contents Legal Information... 4 1. Introduction... 5 1.1. Product Components... 5 1.2. Compatibility with MPI Implementations... 6 1.3. What's New...

More information

Performance Analysis with Vampir

Performance Analysis with Vampir Performance Analysis with Vampir Ronny Brendel Technische Universität Dresden Outline Part I: Welcome to the Vampir Tool Suite Mission Event Trace Visualization Vampir & VampirServer The Vampir Displays

More information

Standard MPI - Message Passing Interface

Standard MPI - Message Passing Interface c Ewa Szynkiewicz, 2007 1 Standard MPI - Message Passing Interface The message-passing paradigm is one of the oldest and most widely used approaches for programming parallel machines, especially those

More information

A configurable binary instrumenter

A configurable binary instrumenter Mitglied der Helmholtz-Gemeinschaft A configurable binary instrumenter making use of heuristics to select relevant instrumentation points 12. April 2010 Jan Mussler j.mussler@fz-juelich.de Presentation

More information

Performance Tools Hands-On. PATC Apr/2016.

Performance Tools Hands-On. PATC Apr/2016. Performance Tools Hands-On PATC Apr/2016 tools@bsc.es Accounts Users: nct010xx Password: f.23s.nct.0xx XX = [ 01 60 ] 2 Extrae features Parallel programming models MPI, OpenMP, pthreads, OmpSs, CUDA, OpenCL,

More information

Center for Information Services and High Performance Computing (ZIH) Session 3: Hands-On

Center for Information Services and High Performance Computing (ZIH) Session 3: Hands-On Center for Information Services and High Performance Computing (ZIH) Session 3: Hands-On Dr. Matthias S. Müller (RWTH Aachen University) Tobias Hilbrich (Technische Universität Dresden) Joachim Protze

More information

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance

More information

Programming with MPI on GridRS. Dr. Márcio Castro e Dr. Pedro Velho

Programming with MPI on GridRS. Dr. Márcio Castro e Dr. Pedro Velho Programming with MPI on GridRS Dr. Márcio Castro e Dr. Pedro Velho Science Research Challenges Some applications require tremendous computing power - Stress the limits of computing power and storage -

More information

Practical Scientific Computing: Performanceoptimized

Practical Scientific Computing: Performanceoptimized Practical Scientific Computing: Performanceoptimized Programming Programming with MPI November 29, 2006 Dr. Ralf-Peter Mundani Department of Computer Science Chair V Technische Universität München, Germany

More information

CSE 160 Lecture 15. Message Passing

CSE 160 Lecture 15. Message Passing CSE 160 Lecture 15 Message Passing Announcements 2013 Scott B. Baden / CSE 160 / Fall 2013 2 Message passing Today s lecture The Message Passing Interface - MPI A first MPI Application The Trapezoidal

More information

General Purpose Timing Library (GPTL)

General Purpose Timing Library (GPTL) General Purpose Timing Library (GPTL) A tool for characterizing parallel and serial application performance Jim Rosinski Outline Existing tools Motivation API and usage examples PAPI interface auto-profiling

More information

Programming with MPI. Pedro Velho

Programming with MPI. Pedro Velho Programming with MPI Pedro Velho Science Research Challenges Some applications require tremendous computing power - Stress the limits of computing power and storage - Who might be interested in those applications?

More information

Scalasca 1.3 User Guide

Scalasca 1.3 User Guide Scalasca 1.3 User Guide Scalable Automatic Performance Analysis March 2011 The Scalasca Development Team scalasca@fz-juelich.de Copyright 1998 2011 Forschungszentrum Jülich GmbH, Germany Copyright 2009

More information

MATH 676. Finite element methods in scientific computing

MATH 676. Finite element methods in scientific computing MATH 676 Finite element methods in scientific computing Wolfgang Bangerth, Texas A&M University Lecture 41: Parallelization on a cluster of distributed memory machines Part 1: Introduction to MPI Shared

More information

Cray XC Scalability and the Aries Network Tony Ford

Cray XC Scalability and the Aries Network Tony Ford Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?

More information

Germán Llort

Germán Llort Germán Llort gllort@bsc.es >10k processes + long runs = large traces Blind tracing is not an option Profilers also start presenting issues Can you even store the data? How patient are you? IPDPS - Atlanta,

More information