Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Similar documents
Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir (continued)

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Center for Information Services and High Performance Computing (ZIH) Session 3: Hands-On

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1

HPC Lab. Session 4: Profiler. Sebastian Rettenberger, Chaulio Ferreira, Michael Bader. November 9, 2015

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

SCORE-P. USER MANUAL 1.3 (revision 7349) Fri Aug :42:08

SCORE-P USER MANUAL. 4.0 (revision 13505) Wed May :20:42

OpenACC Support in Score-P and Vampir

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

TAU Performance System Hands on session

Automatic trace analysis with the Scalasca Trace Tools

AutoTune Workshop. Michael Gerndt Technische Universität München

Multi-Application Online Profiling Tool

SCALASCA v1.0 Quick Reference

Recent Developments in Score-P and Scalasca V2

Automatic Tuning of HPC Applications with Periscope. Michael Gerndt, Michael Firbach, Isaias Compres Technische Universität München

Performance Analysis with Periscope

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW

[Scalasca] Tool Integrations

Improving Applica/on Performance Using the TAU Performance System

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Analysis report examination with Cube

Performance Analysis of Parallel Scientific Applications In Eclipse

Code Auto-Tuning with the Periscope Tuning Framework

Integrating Parallel Application Development with Performance Analysis in Periscope

VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING. Tools Guide October 2017

EZTrace upcoming features

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

CUDA GPGPU Workshop 2012

Profiling with TAU. Le Yan. 6/6/2012 LONI Parallel Programming Workshop

Mitglied der Helmholtz-Gemeinschaft. Performance Analysis. An Introduction. Florian Janetzko. July 09, 2014

Profiling with TAU. Le Yan. User Services LSU 2/15/2012

Accelerate HPC Development with Allinea Performance Tools

Tau Introduction. Lars Koesterke (& Kent Milfeld, Sameer Shende) Cornell University Ithaca, NY. March 13, 2009

Application of KGen and KGen-kernel

ELP. Effektive Laufzeitunterstützung für zukünftige Programmierstandards. Speaker: Tim Cramer, RWTH Aachen University

ARCHER Single Node Optimisation

Programming for Fujitsu Supercomputers

Vampir 9 User Manual

I/O Profiling Towards the Exascale

Introduction to Parallel Performance Engineering with Score-P. Marc-André Hermanns Jülich Supercomputing Centre

Scalability Improvements in the TAU Performance System for Extreme Scale

Implementation of Parallelization

Hands-on: NPB-MZ-MPI / BT

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

NEXTGenIO Performance Tools for In-Memory I/O

BSC Tools Hands-On. Judit Giménez, Lau Mercadal Barcelona Supercomputing Center

JCudaMP: OpenMP/Java on CUDA

TAU 2.19 Quick Reference

GPUs and Emerging Architectures

Introduction to VI-HPS

Parallel Programming Libraries and implementations

Introduction to Performance Engineering

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Performance Analysis with Vampir

Performance analysis : Hands-on

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Parallel Performance Analysis Using the Paraver Toolkit

Vampir 8 User Manual

Parallel Programming. Libraries and Implementations

OpenACC Course. Office Hour #2 Q&A

Vampir 8 User Manual

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

PAPI - PERFORMANCE API. ANDRÉ PEREIRA

HPC with PGI and Scalasca

Performance Analysis with Vampir

HPC Tools on Windows. Christian Terboven Center for Computing and Communication RWTH Aachen University.

Oracle Developer Studio Performance Analyzer

Interactive Performance Analysis with Vampir UCAR Software Engineering Assembly in Boulder/CO,

IBM High Performance Computing Toolkit

PROGRAMMING MODEL EXAMPLES

6.1 Multiprocessor Computing Environment

Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING. BSC Tools Hands-On. Germán Llort, Judit Giménez. Barcelona Supercomputing Center

Compiling applications for the Cray XC

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

TAU by example - Mpich

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Performance Analysis of Large-Scale OpenMP and Hybrid MPI/OpenMP Applications with Vampir NG

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

PAPI - PERFORMANCE API. ANDRÉ PEREIRA

Change Log Version Description of Change

Profiling: Understand Your Application

Practical: a sample code

The PAPI Cross-Platform Interface to Hardware Performance Counters

Accelerating sequential computer vision algorithms using commodity parallel hardware

The Titan Tools Experience

n N c CIni.o ewsrg.au

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Scalasca performance properties The metrics tour

Performance and Environment Monitoring for Continuous Program Optimization

AN LLVM INSTRUMENTATION PLUG-IN FOR SCORE-P

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Portable Power/Performance Benchmarking and Analysis with WattProf

Shared Memory programming paradigm: openmp

Dresden, September Dan Terpstra Jack Dongarra Shirley Moore. Heike Jagode

Transcription:

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir VI-HPS Team

Score-P: Specialized Measurements and Analyses

Mastering build systems Hooking up the Score-P instrumenter scorep into complex build environments like Autotools or CMake was always challenging Score-P provides new convenience wrapper scripts to simplify this (since Score-P 2.0) Autotools and CMake need the used compiler already in the configure step, but instrumentation should not happen in this step, only in the build step % SCOREP_WRAPPER=off \ > cmake.. \ > -DCMAKE_C_COMPILER=scorep-icc \ > -DCMAKE_CXX_COMPILER=scorep-icpc Disable instrumentation in the configure step Specify the wrapper scripts as the compiler to use Allows to pass addition options to the Score-P instrumenter and the compiler via environment variables without modifying the Makefiles Run scorep-wrapper --help for a detailed description and the available wrapper scripts of the Score-P installation 3

Mastering C++ applications Automatic compiler instrumentation greatly disturbs C++ applications because of frequent/short function calls => Use sampling instead Novel combination of sampling events and instrumentation of MPI, OpenMP, Sampling replaces compiler instrumentation (instrument with --nocompiler to further reduce overhead) => Filtering not needed anymore Instrumentation is used to get accurate times for parallel activities to still be able to identifies patterns of inefficiencies Supports profile and trace generation % export SCOREP_ENABLE_UNWINDING=true % # use the default sampling frequency % #export SCOREP_SAMPLING_EVENTS=perf_cycles@2000000 % OMP_NUM_THREADS=4 mpiexec np 4./bt-mz_W.4 Set new configuration variable to enable sampling Available since Score-P 2.0, only x86-64 supported currently 4

Mastering C++ applications Less disturbed measurement 5

Mastering application memory usage Determine the maximum heap usage per process Find high frequent small allocation patterns Find memory leaks Support for: C, C++, MPI, and SHMEM (Fortran only for GNU Compilers) Profile and trace generation (profile recommended) Memory leaks are recorded only in the profile Resulting traces are not supported by Scalasca yet % export SCOREP_MEMORY_RECORDING=true % export SCOREP_MPI_MEMORY_RECORDING=true % OMP_NUM_THREADS=4 mpiexec np 4./bt-mz_W.4 Set new configuration variable to enable memory recording Available since Score-P 2.0 6

Mastering application memory usage Different maximum heap usages per ranks 7

Mastering application memory usage Memory leaks 8

Mastering heterogeneous applications Record CUDA applications and device activities % export SCOREP_CUDA_ENABLE=gpu,kernel,idle Record OpenCL applications and device activities % export SCOREP_OPENCL_ENABLE=api,kernel Record OpenACC applications % export SCOREP_OPENACC_ENABLE=yes Can be combined with CUDA if it is a NVIDIA device % export SCOREP_CUDA_ENABLE=kernel 9

Mastering heterogeneous applications Host-Device memory transfers Device activities OpenACC directives CUDA API calls 10

Enriching measurements with performance counters Record metrics from PAPI: % export SCOREP_METRIC_PAPI=PAPI_TOT_CYC % export SCOREP_METRIC_PAPI_PER_PROCESS=PAPI_L3_TCM Use PAPI tools to get available metrics and valid combinations: % papi_avail % papi_native_avail Record metrics from Linux perf: % export SCOREP_METRIC_PERF=cpu-cycles % export SCOREP_METRIC_PERF_PER_PROCESS=LLC-load-misses Use the perf tool to get available metrics and valid combinations: % perf list Only the master thread records the metric (assuming all threads of the process access the same L3 cache) Write your own metric plugin Repository of available plugins: https://github.com/score-p 11

Score-P user instrumentation API No replacement for automatic compiler instrumentation Can be used to further subdivide functions E.g., multiple loops inside a function Can be used to partition application into coarse grain phases E.g., initialization, solver, & finalization Enabled with --user flag to Score-P instrumenter Available for Fortran / C / C++ 12

Score-P user instrumentation API (Fortran) #include "scorep/scorep_user.inc" subroutine foo( )! Declarations SCOREP_USER_REGION_DEFINE( solve )! Some code SCOREP_USER_REGION_BEGIN( solve, <solver>", \ SCOREP_USER_REGION_TYPE_LOOP ) do i=1,100 [...] end do SCOREP_USER_REGION_END( solve )! Some more code end subroutine Requires processing by the C preprocessor For most compilers, this can be automatically achieved by having an uppercase file extension, e.g., main.f or main.f90 13

Score-P user instrumentation API (C/C++) #include "scorep/scorep_user.h" void foo() { /* Declarations */ SCOREP_USER_REGION_DEFINE( solve ) } /* Some code */ SCOREP_USER_REGION_BEGIN( solve, <solver>", SCOREP_USER_REGION_TYPE_LOOP ) for (i = 0; i < 100; i++) { [...] } SCOREP_USER_REGION_END( solve ) /* Some more code */ 14

Score-P user instrumentation API (C++) #include "scorep/scorep_user.h" void foo() { // Declarations } // Some code { SCOREP_USER_REGION( <solver>", SCOREP_USER_REGION_TYPE_LOOP ) for (i = 0; i < 100; i++) { [...] } } // Some more code 15

Score-P measurement control API Can be used to temporarily disable measurement for certain intervals Annotation macros ignored by default Enabled with --user flag #include scorep/scorep_user.inc subroutine foo( )! Some code SCOREP_RECORDING_OFF()! Loop will not be measured do i=1,100 [...] end do SCOREP_RECORDING_ON()! Some more code end subroutine Fortran (requires C preprocessor) #include scorep/scorep_user.h void foo( ) { /* Some code */ SCOREP_RECORDING_OFF() /* Loop will not be measured */ for (i = 0; i < 100; i++) { [...] } SCOREP_RECORDING_ON() /* Some more code */ } C / C++ 16

Score-P: Conclusion and Outlook

Project management Ensure a single official release version at all times which will always work with the tools Allow experimental versions for new features or research Commitment to joint long-term cooperation Development based on meritocratic governance model Open for contributions and new partners 18

Future features Scalability to maximum available CPU core count Support for emerging architectures and new programming models Features currently worked on: User provided wrappers to 3 rd party libraries Hardware and MPI topologies Basic support of measurements without re-compiling/-linking I/O recording Java recording Persistent memory recording (e.g., PMEM, NVRAM, ) 19

Further information Community instrumentation & measurement infrastructure Instrumentation (various methods) and sampling Basic and advanced profile generation Event trace recording Online access to profiling data Available under New BSD open-source license Documentation & Sources: http://www.score-p.org User guide also part of installation: <prefix>/share/doc/scorep/{pdf,html}/ Support and feedback: support@score-p.org Subscribe to news@score-p.org, to be up to date 20