Recent Developments in Score-P and Scalasca V2

Similar documents
[Scalasca] Tool Integrations

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Scalasca: A Scalable Portable Integrated Performance Measurement and Analysis Toolset. CEA Tools 2012 Bernd Mohr

Performance Analysis and Optimization of Scientific Applications on Extreme-Scale Computer Systems

Automatic trace analysis with the Scalasca Trace Tools

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Analysis report examination with Cube

Cube v4 : From performance report explorer to performance analysis tool

Change Log Version Description of Change

AutoTune Workshop. Michael Gerndt Technische Universität München

Performance-oriented development

Performance analysis on Blue Gene/Q with

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

SCORE-P USER MANUAL. 4.0 (revision 13505) Wed May :20:42

Introduction to VI-HPS

Profile analysis with CUBE. David Böhme, Markus Geimer German Research School for Simulation Sciences Jülich Supercomputing Centre

Automatic Tuning of HPC Applications with Periscope. Michael Gerndt, Michael Firbach, Isaias Compres Technische Universität München

Large-scale performance analysis of PFLOTRAN with Scalasca

Large-scale performance analysis of PFLOTRAN with Scalasca

Analysis report examination with CUBE

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Performance analysis of Sweep3D on Blue Gene/P with Scalasca

A configurable binary instrumenter

I/O at JSC. I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O. Wolfgang Frings

Energy Efficiency Tuning: READEX. Madhura Kumaraswamy Technische Universität München

HPC Tools on Windows. Christian Terboven Center for Computing and Communication RWTH Aachen University.

SCALASCA parallel performance analyses of SPEC MPI2007 applications

SCORE-P. USER MANUAL 1.3 (revision 7349) Fri Aug :42:08

READEX Runtime Exploitation of Application Dynamism for Energyefficient

Integrating Parallel Application Development with Performance Analysis in Periscope

IBM High Performance Computing Toolkit

D5.3 Basic Score-P OpenCL support Version 1.0. Document Information

SCALASCA v1.0 Quick Reference

Code Auto-Tuning with the Periscope Tuning Framework

Introduction to Parallel Performance Engineering with Score-P. Marc-André Hermanns Jülich Supercomputing Centre

VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING. Tools Guide October 2017

Usage of the SCALASCA toolset for scalable performance analysis of large-scale parallel applications

Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes

I/O Monitoring at JSC, SIONlib & Resiliency

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW

Parallel I/O on JUQUEEN

Introduction to Performance Engineering

TAU Performance System Hands on session

Improving Applica/on Performance Using the TAU Performance System

Welcome to the. Jülich Supercomputing Centre. D. Rohe and N. Attig Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich

Scalability Improvements in the TAU Performance System for Extreme Scale

Welcome to the. Jülich Supercomputing Centre. D. Rohe and N. Attig Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich

Introduction to Parallel Performance Engineering

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir (continued)

Virtual Institute High Productivity Supercomputing Code Tuning Tutorial

Scalasca 1.4 User Guide

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

HPC Lab. Session 4: Profiler. Sebastian Rettenberger, Chaulio Ferreira, Michael Bader. November 9, 2015

READEX: A Tool Suite for Dynamic Energy Tuning. Michael Gerndt Technische Universität München

Scalasca 1.3 User Guide

Trends in HPC (hardware complexity and software challenges)

Visualization and Data Analysis using VisIt - In Situ Visualization -

Eclipse-PTP: An Integrated Environment for the Development of Parallel Applications

Performance Analysis with Periscope

Center for Information Services and High Performance Computing (ZIH) Session 3: Hands-On

Multi-Application Online Profiling Tool

Interactive Performance Analysis with Vampir UCAR Software Engineering Assembly in Boulder/CO,

Parallel Performance Tools

The PAPI Cross-Platform Interface to Hardware Performance Counters

Performance Analysis with Vampir. Joseph Schuchart ZIH, Technische Universität Dresden

ELP. Effektive Laufzeitunterstützung für zukünftige Programmierstandards. Speaker: Tim Cramer, RWTH Aachen University

Analysis report examination with CUBE. Monika Lücke German Research School for Simulation Sciences

JÜLICH SUPERCOMPUTING CENTRE Site Introduction Michael Stephan Forschungszentrum Jülich

Performance Analysis for Large Scale Simulation Codes with Periscope

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications

Virtual Institute High Productivity Supercomputing

Scalable, Automated Parallel Performance Analysis with TAU, PerfDMF and PerfExplorer

EZTrace upcoming features

Industrial Multicore Software with EMB²

Performance Cockpit: An Extensible GUI Platform for Performance Tools

Getting Insider Information via the New MPI Tools Information Interface

Improving the Scalability of Performance Evaluation Tools

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

Parallel Programming Libraries and implementations

Pedraforca: a First ARM + GPU Cluster for HPC

The Mont-Blanc project Updates from the Barcelona Supercomputing Center

Vectorisation and Portable Programming using OpenCL

Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr

Profiling Non-numeric OpenSHMEM Applications with the TAU Performance System

Performance Analysis of Large-Scale OpenMP and Hybrid MPI/OpenMP Applications with Vampir NG

The DEEP (and DEEP-ER) projects

OpenACC Support in Score-P and Vampir

meinschaft May 2012 Markus Geimer

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

OpenCL: History & Future. November 20, 2017

Performance properties The metrics tour

An Implementation of the POMP Performance Monitoring for OpenMP based on Dynamic Probes

HPC on Windows. Visual Studio 2010 and ISV Software

Overview of research activities Toward portability of performance

MPI RUNTIMES AT JSC, NOW AND IN THE FUTURE

Hardware Counter Performance Analysis of Parallel Programs

Hands-on: NPB-MZ-MPI / BT

Transcription:

Mitglied der Helmholtz-Gemeinschaft Recent Developments in Score-P and Scalasca V2 Aug 2015 Bernd Mohr 9 th Scalable Tools Workshop Lake Tahoe

YOU KNOW YOU MADE IT IF LARGE COMPANIES STEAL YOUR STUFF August 2015 JSC 2

Source: https://software.intel.com/en-us/videos/quickly-discover-performance-issues-with-the-intel-trace-analyzer-and-collector-90-beta August 2015 JSC 3

Source: https://software.intel.com/en-us/videos/quickly-discover-performance-issues-with-the-intel-trace-analyzer-and-collector-90-beta August 2015 JSC 4

Source: https://software.intel.com/en-us/videos/quickly-discover-performance-issues-with-the-intel-trace-analyzer-and-collector-90-beta August 2015 JSC 5

August 2015 JSC 6

Scalasca Scalable Analysis of Large Scale Applications Approach Instrument C, C++, and Fortran parallel applications Based on MPI, OpenMP, SHMEM, or hybrid Option 1: scalable call-path profiling Option 2: scalable event trace analysis Collect event traces Search trace for event patterns representing inefficiencies Categorize and rank inefficiencies found Supports MPI 2.2 (P2P, collectives, RMA, IO) and OpenMP 3.0 (exception: nesting) http://www.scalasca.org/ August 2015 JSC 7

Report manipulation Optimized measurement configuration Measurement Score-P library Summary report Instr. target application PAPI Local event traces Parallel Scalasca waitstate Analyzer Trace search Wait-state report Instrumented executable Which problem? Where in the program? Which process? Instrumenter compiler / linker Source modules CUBE August 2015 JSC 8

Scalasca Command Prepare application objects and executable for measurement Run application under control of measurement system Interactively explore measurement analysis report Scalasca 1 Scalasca 2 1)scalasca instrument <compile-or-linkcommand> 2)skin <compile-or-linkcommand> 1)scalasca instrument <compile-or-linkcommand>* 2)skin <compile-or-linkcommand>* 3)scorep <compile-or-linkcommand>** 1)scalasca analyze <application-launch-command> 2)scan <application-launch-command> 3)set environment variables and run as usual 1) scalasca examine <experiment-archive report> 2) square <experiment-archive report> * command is deprecated and only provided for backwards compatibility with Scalasca 1.x. ** recommended option August 2015 Score-P Scalasca2 JSC New Features and Future Plans 9

Scalasca 1 vs Scalasca 2 Scalasca 1 Scalasca 2 Instrumentation EPIK Score-P Command line switches Manual instrumentation API Environmental variables Memory buffers separate for each thread different different different memory pool on each process Trace format EPILOG OTF2 Structure of the filterfile different Scalable I/O supports SIONlib partially supports SIONlib Report format CUBE3 CUBE4 Experiment directory epik_ scorep_ License 3-clause BSD August 2015 Score-P Scalasca2 JSC New Features and Future Plans 10

For more information Zhukov, I. ; Feld, C. ; Geimer, M. ; Knobloch, M. ; Mohr, B. ; Saviankou, P. Scalasca v2: Back to the Future Niethammer, Christoph (Editor), ISBN: 978-3-319-16011-5 Tools for High Performance Computing 2014, Stuttgart, Germany, 2015 [doi:10.1007/978-3-319-16012-2_1] August 2015 JSC 11

August 2015 JSC 12

Integration Need integrated tool (environment) for all levels of parallelization Inter-node (MPI, PGAS, SHMEM) Intra-node (OpenMP, multi-threading, multi-tasking) Accelerators (CUDA, OpenCL) Integration with performance modeling and prediction No tool fits all requirements Interoperability of tools Integration via open interfaces August 2015 JSC 13

Score-P Functionality Provide typical functionality for HPC performance tools Instrumentation (various methods) Multi-process paradigms (MPI, SHMEM) Thread-parallel paradigms (OpenMP, POSIX threads) Accelerator-based paradigms (CUDA, OpenCL) And their combination Flexible measurement without re-compilation: Basic and advanced profile generation Event trace recording Online access to profiling data Highly scalable I/O functionality Support all fundamental concepts of partner s tools August 2015 JSC 14

Non-functional Requirements Portability: support all major HPC platforms IBM Blue Gene, Cray X*, Fujitsu K/FX10 x86, x86_64, PPC, Sparc, ARM clusters (Linux, AIX, Solaris) Scalability Petascale, supporting platforms with more than 100K cores Low measurement overhead Typically less than 5% Robustness and QA Nightly Builds, Continuous Integration Testing Framework Easy and uniform installation through EasyBuild Open Source: New BSD License August 2015 JSC 15

Tool Dependencies Note! Only 1 tool chain (compiler/mpi combination) Only 1 version August 2015 JSC 16

Score-P Partners Forschungszentrum Jülich, Germany German Research School for Simulation Sciences, Aachen, Germany Gesellschaft für numerische Simulation mbh Braunschweig, Germany RWTH Aachen, Germany Technische Universität Dresden, Germany Technische Universität München, Germany University of Oregon, Eugene, USA August 2015 JSC 17

The Score-P Tool Ecosystem Periscope TAU ParaProf TAU PerfExplorer CUBE4 report CUBE Online interface Score-P PAPI CUBE4 report Scalasca wait-state analysis Vampir Remote Guidance Instrumented target application OTF2 traces August 2015 JSC 18

Past Funded Integration Projects SILC (01/2009 to 12/2011) Unified measurement system (Score-P) for Vampir, Scalasca, Periscope PRIMA (08/2009 to 10/2013) Integration of TAU and Scalasca LMAC (08/2011 to 07/2013) Evolution of Score-P Analysis of performance dynamics H4H (10/2010 to 09/2013) Hybrid programming for heterogeneous platforms HOPSA (02/2011 to 01/2013) Integration of system and application monitoring August 2015 JSC 19

Current Funded Integration Projects Score-E (10/2013 to 09/2016) Analysis and Optimization of Energy Consumption PRIMA-X (11/2014 to 10/2017) Extreme scale monitoring and analysis RAPID (04/2014 to 03/2015) Enhanced support for node-level programming models POSIX, ACE, Qt threads, MTAPI Microsoft Windows support Mont-Blanc-2 (10/2013 to 09/2016) OpenCL support OmpSs support August 2015 JSC 20

CUBE V4 PLUGIN INTERFACE August 2015 JSC 21

GUI Plugin: CallGraph August 2015 Score-P Scalasca2 JSC New Features and Future Plans 22

Cube Viz Plugins: Phase Heatmap Phase profiling Collects data for each instance of phases marked in program instead of aggregating it Shows data over time (phase instances) for each rank/thread 23 August 2015 JSC 23 Mont-Blanc-2 F2F Stuttgart Apr 16 th /17 th, 2015

Cube Viz Plugins: Phase Barplot Phase profiling Collects data for each instance of phases marked in program instead of aggregating it Shows min/max/avg metric value over time (phase instances) 24 August 2015 JSC 24 Mont-Blanc-2 F2F Stuttgart Apr 16 th /17 th, 2015

Integration of Measurement and Modelling Example: DFG SPPEXA Catwalk Project main() { foo() bar() compute() } Instrumentation All functions Performance measurements (profiles) p 1 = 128 p 4 = 1,024 p 2 = 256 p 5 = 2,048 p 3 = 512 p 6 = 4,096 Input Output Automated modeling Rank Function Model [s] 1 bar() 4.0 * p + 0.1*log(p) 2 compute() 0.5 * log(p) 3 foo() 65.7 August 2015 JSC 25

Catwalk: Result Visualization August 2015 JSC 26

CUBE Derived Metrics Cube v4 now also supports definition of derived metrics Based on CubePL DSL PreDerived and PostDerived metrics List of selected features: Support for various arithmetic calls Support of arrays and variables Automatic data type conversion Lambda-function definitions Predefined variables Redefinition of aggregation operation Saviankou, P. ; Knobloch, M. ; Visser, A. ; Mohr, B. Cube v4: From Performance Report Explorer to Performance Analysis Tool International Conference On Computational Science (ICCS 2015) Procedia computer science 51, 1343-1352 (2015) [doi:10.1016/j.procs.2015.05.320] August 2015 JSC 27

SUCCESS STORIES August 2015 JSC 28

Performance Tool Scaling: Scalasca Latest test case Granular Dynamics Simulation Based on Physics Engine (PE) Framework (Erlangen) PRACE @ ISC Award winner MPI only Scalasca 1.x Experiments on JUQUEEN Full machine experiment: 28,672 nodes x 32 MPI ranks 917,504 processes [Limit: Memory / System metadata] Largest no. of threads: 20,480 nodes x 64 MPI ranks 1,310,720 processes [Limit: Memory / System metadata] Scalasca 2.x / Score-P 1.4.1 NAS BT-MZ on JUQUEEN Profiles: 16,384 x 64 = 1,048,576 threads Traces: 10,240 x 64 = 655,360 thread [Limit: BT-MZ] [Limit: OTF2] August 2015 JSC 29

Scalasca: 1,310,720 process test case August 2015 JSC 30

Showcase: TerrSysMP Scale-consistent highly modular integrated multi-physics sub-surface/surface hydrology-vegetation atmosphere modelling system Fully-coupled MPMD simulation consisting of COSMO (Weather prediction) CLM (Community Land Model) ParFlow (Parallel Watershed Flow) OASIS coupler August 2015 JSC 31

Success Story: TerrSysMP Identified several sub-components bottlenecks: Inefficient communication patterns Unnecessary/inefficient code blocks Inefficient data structures Performance of subcomponents improved by factor of 2! Scaling improved from 512 to 32768 cores! August 2015 JSC 32

The Team Markus Geimer Michael Knobloch Bernd Mohr Christian Rössel Marc Schlütter Pavel Saviankou Alexandre Strube Brian Wylie Anke Visser Ilja Zhukov Sponsors August 2015 JSC 33

Questions? Check out http://www.scalasca.org Or contact us at scalasca@fz-juelich.de August 2015 JSC 34