Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr

Similar documents
Vampirtrace 2.0. Installation and User s Guide. Pallas GmbH Hermülheimer Straße 10 D Brühl, Germany

Introduction to Parallel Performance Engineering

Performance analysis basics

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW

Introduction to the Intel tracing tools

FORSCHUNGSZENTRUM JÜLICH GmbH Zentralinstitut für Angewandte Mathematik D Jülich, Tel. (02461)

Optimization of MPI Applications Rolf Rabenseifner

An Implementation of the POMP Performance Monitoring for OpenMP based on Dynamic Probes

TAU Parallel Performance System. DOD UGC 2004 Tutorial. Part 1: TAU Overview and Architecture

[Scalasca] Tool Integrations

Tau Introduction. Lars Koesterke (& Kent Milfeld, Sameer Shende) Cornell University Ithaca, NY. March 13, 2009

Vampir 2.0. Tutorial. Pallas GmbH Hermülheimer Straße 10 D Brühl, Germany

Introduction to Performance Engineering

ARCHER Single Node Optimisation

Parallel Code Optimisation

Parallel Performance Analysis Using the Paraver Toolkit

MPI and MPICH on Clusters

A brief introduction to MPICH. Daniele D Agostino IMATI-CNR

SCALASCA parallel performance analyses of SPEC MPI2007 applications

TAU Performance Toolkit (WOMPAT 2004 OpenMP Lab)

Hands-on Workshop on How To Debug Codes at the Institute

Automatic trace analysis with the Scalasca Trace Tools

The PAPI Cross-Platform Interface to Hardware Performance Counters

Profiling and Debugging Tools. Lars Koesterke University of Porto, Portugal May 28-29, 2009

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Open SpeedShop Capabilities and Internal Structure: Current to Petascale. CScADS Workshop, July 16-20, 2007 Jim Galarowicz, Krell Institute

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

SCALASCA v1.0 Quick Reference

Tool for Analysing and Checking MPI Applications

Building Library Components That Can Use Any MPI Implementation

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

A Study of Workstation Computational Performance for Real-Time Flight Simulation

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

Introduction: OpenMP is clear and easy..

Introduction to parallel computing with MPI

Implementation of Parallelization

Message Passing Interface

Introduction to Parallel Programming. Martin Čuma Center for High Performance Computing University of Utah

Introduction to the Message Passing Interface (MPI)

High Performance Computing Course Notes Message Passing Programming I

IBM High Performance Computing Toolkit

What s in this talk? Quick Introduction. Programming in Parallel

15-440: Recitation 8

Detection and Analysis of Iterative Behavior in Parallel Applications

Message Passing Programming. Designing MPI Applications

Analysis report examination with Cube

FORSCHUNGSZENTRUM JÜLICH GmbH Zentralinstitut für Angewandte Mathematik D Jülich, Tel. (02461) Particle Simulations on Cray MPP Systems

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

The TOP500 list. Hans-Werner Meuer University of Mannheim. SPEC Workshop, University of Wuppertal, Germany September 13, 1999

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

In 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that:

An introduction to MPI

What s new in SEPLIB?

Lecture 3: Intro to parallel machines and models

Performance Profiling

CS 426. Building and Running a Parallel Application

Improving Applica/on Performance Using the TAU Performance System

WhatÕs New in the Message-Passing Toolkit

6.1 Multiprocessor Computing Environment

MPI Runtime Error Detection with MUST

7. Compiling and Debugging Parallel Fortran

Performance Analysis for Large Scale Simulation Codes with Periscope

Scalability Improvements in the TAU Performance System for Extreme Scale

Introduction to OpenMP

CS691/SC791: Parallel & Distributed Computing

Profiling with TAU. Le Yan. User Services LSU 2/15/2012

Scientific Programming in C XIV. Parallel programming

Performance Analysis and Tuning of Parallel Programs: Resources and Tools

Allinea Unified Environment

Parallel Debugging. Matthias Müller, Pavel Neytchev, Rainer Keller, Bettina Krammer

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

COSC 4397 Parallel Computation. Debugging and Performance Analysis of Parallel MPI Applications

Experiences and Lessons Learned with a Portable Interface to Hardware Performance Counters

HPC Tools on Windows. Christian Terboven Center for Computing and Communication RWTH Aachen University.

Parallel Programming

Introduction to OpenMP

Programming with MPI. Pedro Velho

MPI Mechanic. December Provided by ClusterWorld for Jeff Squyres cw.squyres.com.

Performance Analysis of Parallel Scientific Applications In Eclipse

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

Programming for Fujitsu Supercomputers

Parallel I/O and Portable Data Formats

Advanced Message-Passing Interface (MPI)

Design and Implementation of a Java-based Distributed Debugger Supporting PVM and MPI

Programming with MPI on GridRS. Dr. Márcio Castro e Dr. Pedro Velho

Bull. Performance Tools Guide and Reference AIX ORDER REFERENCE 86 A2 27EG 01

Evaluating OpenMP Performance Analysis Tools with the APART Test Suite

JURECA Tuning for the platform

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1

Performance Analysis with Vampir

Lecture 7: Distributed memory

Visual Profiler. User Guide

Commodity Cluster Computing

Introduction to parallel computing concepts and technics

Parallel Programming Overview

OpenACC Course. Office Hour #2 Q&A

MPI-Adapter for Portable MPI Computing Environment

Transcription:

Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr Research Centre Juelich (FZJ) John von Neumann Institute of Computing (NIC) Central Institute for Applied Mathematics (ZAM) 52425 Juelich, Germany b.mohr@fz-juelich.de

Performance Tuning: an Old Problem! [Intentionally left blank] 2000 Forschungszentrum Jülich, NIC-ZAM [2]

Contents Definitions Practical Performance Analysis Vampirtrace Vampir 2000 Forschungszentrum Jülich, NIC-ZAM [3]

Definitions: Profiling Profiling recording summary information time #calls #FLOPS... about program entities functions loops basic blocks... very good for quick, low cost overview points out potential bottlenecks / hotspots implemented through sampling or instrumentation 2000 Forschungszentrum Jülich, NIC-ZAM [4]

Definitions: Tracing Tracing recording information about significant points (events) during execution of the program entering or leaving a code region (func, loop,...) sending or receiving a message,... event record typically consists of timestamp cpu id, thread id event type plus event specific information event trace := sequence of event records sorted by time can be used to reconstruct the dynamic behavior typically requires code instrumentation 2000 Forschungszentrum Jülich, NIC-ZAM [5]

Definitions: Instrumentation Instrumentation := inserting extra code (hooks) in application Wrapper := entry and exit hook wrapped around function (call) Source code instrumentation done by compiler, src-to-src translator, or manually + portability + link back to source code easy - re-compile necessary for (change in) instrumentation - cannot instrument system or 3rd party libraries / executables - hard to use for mixed-language applications - src-to-src tool hard to implement for C++ (and Fortran90) 2000 Forschungszentrum Jülich, NIC-ZAM [6]

Definitions: Instrumentation Object code instrumentation re-writing the executable to insert hooks Dynamic code instrumentation (like a debugger) object code instrumenation while program is running example: syninst, DPCL +/- the other way round compared to source instrumentation Pre-instrumented library typically used for MPI and PVM program analysis + easy to use: only re-linking necessary - can only record information about library objects 2000 Forschungszentrum Jülich, NIC-ZAM [7]

Event Tracing: Instrumentation, Monitoring, Trace CPU A: void master { trace(enter, 1);... trace(send, B); send(b, tag, buf);... trace(exit, 1); } CPU B: void slave { trace(enter, 2);... recv(a, tag, buf); trace(recv, A);... trace(exit, 2); } MONITOR... 1 master 2 slave 3... 58 A ENTER 1 60 B ENTER 2 62 A SEND B 64 A EXIT 1 68 B RECV A 69 B EXIT 2... 2000 Forschungszentrum Jülich, NIC-ZAM [8]

Event Tracing: Timeline Visualization 1 master 2 slave 3... main master slave... 58 A ENTER 1 60 B ENTER 2 62 A SEND B 64 A EXIT 1 68 B RECV A 69 B EXIT 2 A B... 58 60 62 64 66 68 70 2000 Forschungszentrum Jülich, NIC-ZAM [9]

Practical Performance Analysis and Tuning 50% of 1 Gflop is better than 20% of 2GFlop the sustained fraction of peak performance is crucial Successful tuning is combination of right algorithm and libraries compiler flags and pragmas / directives (Learn and use them) THINKING Measurement is better than reasoning / intuition (= guessing) to determine performance problems to validate tuning decisions / optimizations (after each step!) 2000 Forschungszentrum Jülich, NIC-ZAM [10]

Practical Performance Analysis and Tuning It is easier to optimize a slow correct program than to debug a fast incorrect one Debugging before Tuning Nobody really cares how fast you can compute the wrong answer The 80/20 rule Program spents 80% time in 20% of code Programmer spends 20% effort to get 80% of the total speedup possible in the code Know when to stop! Don t optimize what doesn t matter Make the common case fast 2000 Forschungszentrum Jülich, NIC-ZAM [11]

Typical Performance Analysis Procedure Do I have a performance problem at all? time / HW counter measurements speedup and scalability measurements What is the main bottleneck (calculation/communication/...)? flat profiling (sampling / prof) Where is the main bottleneck? call graph profiling (gprof) detailed (basic block) profiling Where is the bottleneck exactly and why is it there? trace selected parts to keep trace files manageable Does my code has scalability problems? profile code for typical small and large processor count compare profiles function-by-function 2000 Forschungszentrum Jülich, NIC-ZAM [12]

Limiting Trace File Size Use smallest number of processors possible Use smallest input data set (e.g. number of timesteps) Limit trace file size by environment variables / API functions / config files? Trace only selected parts by using control functions to switch tracing on/off select important parts only for iterative programs: trace some iterations only Trace only important events no MPI administrative funcs and/or MPI collective funcs only user function in call tree level 1 or 2 never functions called in (inner) loops NOTE: Make sure to collect complete message traffic! For non-blocking (MPI_I*) include MPI_Wait*+MPI_Test* 2000 Forschungszentrum Jülich, NIC-ZAM [13]

Vampirtrace Commercial product of Pallas, Germany Library for Tracing of MPI and Application Events records point-to-point communication records user subroutines (on request) records collective communication (V2.0) recordsmpi 2 I/O operations (V2.0) records source code information (V2.0 some platforms) support for shmem (V2.0 Cray T3E) uses the MPI profiling interface http://www.pallas.de/pages/vampirt.htm 2000 Forschungszentrum Jülich, NIC-ZAM [14]

Vampirtrace: Usage Record MPI related information Re link a MPI application (no re-compilation necessary!) % f90 *.o -o myprog -L$(VTHOME)/lib -lvt -lpmpi -lmpi % cc *.o -o myprog -L$(VTHOME)/lib -lvt -lpmpi -lmpi % CC *.o -o myprog -L$(VTHOME)/lib -lvt -lpmpi -lmpi [MPICH] Re-link using -vt option to MPI compiler scripts % mpif90 -vt *.o -o myprog % mpicc -vt *.o -o myprog % mpicc -vt *.o -o myprog Execute MPI binary as usual Record user subroutines insert calls to the Vampirtrace API (portable, but work) automatic instrumentation (NEC, Fujitsu, Hitachi) use instrumentation tool (Cray PAT, dyninst,...) 2000 Forschungszentrum Jülich, NIC-ZAM [15]

Vampirtrace Instrumentation API (C/C++) Calls for recording user subroutines #include "VT.h" /* -- Symbols can be defined with or without source information -- */ VT_symdefl(123, "foo", "USER", "foo.c:6"); VT_symdef (123, "foo", "USER"); void foo { VT_begin(123); /* 1 st executable line */... VT_end(123); /* at EVERY exit point! */ } VT calls can only be used between MPI_Init and MPI_Finalize! Event numbers used must be globally unique Selective tracing through VT_traceoff(); VT_traceon(); # Turn tracing off # Turn tracing on 2000 Forschungszentrum Jülich, NIC-ZAM [16]

VT++.h - C++ class wrapper for Vampirtrace #ifndef VT_PLUSPLUS_ #define VT_PLUSPLUS_ #include "VT.h" class VT_Trace { public: VT_Trace(int code) {VT_begin(code_ = code);} ~VT_Trace() {VT_end(code_);} private: int code_; }; #endif /* VT_PLUSPLUS_ */ Same tricks can be used to wrap other tracing APIs for C++ too Usage: VT_symdef(123, "foo", "USER"); // symbol definition as before void foo(void) { // user subroutine to monitor VT_Trace vt(123); // declare VT_Trace object in 1 st line... // => automatic tracing by ctor/dtor } 2000 Forschungszentrum Jülich, NIC-ZAM [17]

Vampirtrace Instrumentation API (Fortran) Calls for recording user subroutines include 'VT.inc' integer ierr call VTSYMDEF(123, "foo", "USER", ierr)!or call VTSYMDEFL(123, "foo", "USER", "foo.f:8", ierr) C SUBROUTINE foo(...) include 'VT.inc' integer ierr call VTBEGIN(123, ierr) C... call VTEND(123, ierr); END Selective tracing through VTTRACEOFF() VTTRACEON() # Turn tracing off # Turn tracing on 2000 Forschungszentrum Jülich, NIC-ZAM [18]

Vampirtrace: Runtime Configuration Trace file collection and generation can be controlled by using a configuration file trace file name, location, size flush behavior activation/deactivation of trace recording for specific processes, activities (groups of symbols), and symbols Activate a configuration file by setting environment variables VT_CONFIG name of configuration file (use absolute pathname if possible) VT_CONFIG_RANK MPI rank of process which should process configuration file (default: 0) Reduce trace file sizes restrict event collection in a configuration file use selective tracing functions (VT_traceoff/VT_traceon()) 2000 Forschungszentrum Jülich, NIC-ZAM [19]

Vampirtrace: Configuration File Example # collect traces only for MPI ranks 1 to 5 TRACERANKS 1:5:1 # record at most 20000 records per rank MAX-RECORDS 20000 # do not collect administrative MPI calls SYMBOL MPI_comm* off SYMBOL MPI_cart* off SYMBOL MPI_group* off SYMBOL MPI_type* off # do not collect USER events ACTIVITY USER off # except routine foo SYMBOL foo on Be careful to record complete message transfers! For more complete description see Vampirtrace User's Guide 2000 Forschungszentrum Jülich, NIC-ZAM [20]

Vampirtrace: Availability Supported platforms Compaq Alpha, Tru64 Unix CRAY T3E Unicos/mk Fujitsu VPP300, VPP700, and VPP5000 UXP/V Hitachi SR8000, HI-UX/MPP HP PA, HP-UX 10 + 11, SPP5 IBM RS/6000 and SP-2, AIX 4.x Linux/Intel 2.x NEC SX-4, S-UX 8 SGI workstations, Origin, and PowerChallenge, IRIX 6 Sun Solaris/SPARC 2.7 Sun Solaris/Intel 2.6 2000 Forschungszentrum Jülich, NIC-ZAM [21]

Vampir Visualization and Analysis of MPI Programs Originally developed by Forschungszentrum Jülich Now main development by Technical University Dresden Commercially distributed by PALLAS, Germany http://www.pallas.de/pages/vampir.htm 2000 Forschungszentrum Jülich, NIC-ZAM [22]

Vampir: General Description Offline trace analysis for message passing trace files Convenient user interface / Easy customization Scalability in time and processor space Excellent zooming and filtering Display and analysis of MPI and application events: user subroutines point to point communication collective communication (V2.5) MPI 2 I/O operations (V2.5) All diagrams highly customizable (through context menus) Large variety of displays for ANY part of the trace 2000 Forschungszentrum Jülich, NIC-ZAM [23]

Vampir: Main window Trace file loading can be interrupted at any time resumed started at a specified time offset Provides main menu access to global and process local displays preferences help Trace file can be re written (re grouped symbols) 2000 Forschungszentrum Jülich, NIC-ZAM [24]

Vampir: Displays Global displays show all selected processes Summary Chart: aggregated profiling information Activity Chart: presents per process profiling information Timeline: detailed application execution over time axis Communication statistics: message statistics for each process pair Global Comm. Statistics: collective operations statistics I/O Statistics: MPI I/O operation statistics Calling Tree: global dynamic calling tree Process displays show a single process per window for selected processes Activity Chart Timeline Calling Tree 2000 Forschungszentrum Jülich, NIC-ZAM [25]

Vampir: Time Line Diagram Functions organized into groups coloring by group Message lines can be colored by tag or size Information about states, messages, collective and I/O operations available through clicking on the representation 2000 Forschungszentrum Jülich, NIC-ZAM [26]

Vampir: Time Line Diagram (Message Info) Source code references are be displayed if recorded in trace 2000 Forschungszentrum Jülich, NIC-ZAM [27]

Vampir: Support for Collective Communication For each process: locally mark operation Stop of op Start of op Data being sent Connect start/stop points by lines Data being received 2000 Forschungszentrum Jülich, NIC-ZAM [28]

Vampir: Collective Communication 2000 Forschungszentrum Jülich, NIC-ZAM [29]

Vampir: MPI-I/O Support MPI I/O operations shown as message lines to separarte I/O system time line 2000 Forschungszentrum Jülich, NIC-ZAM [30]

Vampir: Execution Statistics Aggregated profiling information: execution time, number of calls, inclusive/exclusive Available for all / any group (activity) or all routines (symbols) Available for any part of the trace (selectable through time line diagram) 2000 Forschungszentrum Jülich, NIC-ZAM [31]

Vampir: Communication Statistics Bytes sent/received for collective operations byte and message count, min/max/avg message length and min/max/avg bandwidth for each process pair Message length statistics Available for any part of the trace 2000 Forschungszentrum Jülich, NIC-ZAM [32]

Vampir: Other Features Dynamic global call graph tree parallelism display Powerful filtering and trace comparison features All diagrams highly customizable (through context menus) 2000 Forschungszentrum Jülich, NIC-ZAM [33]

Vampir: Process Displays Activity chart Call tree Time line For all selected processes in the global displays 2000 Forschungszentrum Jülich, NIC-ZAM [34]

Vampir: Availability Supported platforms Compaq Alpha, Tru64 Unix Cray T3E Hitachi SR2201, HI-UX/MPP HP PA, HP-UX 10 HP Exemplar, SPP-UX 5.2 IBM RS/6000 and SP-2, AIX 4 Linux/Intel 2.0 NEC EWS, EWS-UX 4 NEC SX-4, S-UX 8 SGI workstations, Origin and PowerChallenge, IRIX 6 Sun Solaris/SPARC 2.7 Sun Solaris/Intel 2.6 2000 Forschungszentrum Jülich, NIC-ZAM [35]