Performance Analysis and Optimization of Scientific Applications on Extreme-Scale Computer Systems

Similar documents
Scalasca: A Scalable Portable Integrated Performance Measurement and Analysis Toolset. CEA Tools 2012 Bernd Mohr

[Scalasca] Tool Integrations

Recent Developments in Score-P and Scalasca V2

Performance-oriented development

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1

Large-scale performance analysis of PFLOTRAN with Scalasca

Large-scale performance analysis of PFLOTRAN with Scalasca

Performance analysis of Sweep3D on Blue Gene/P with Scalasca

AutoTune Workshop. Michael Gerndt Technische Universität München

Automatic trace analysis with the Scalasca Trace Tools

Introduction to VI-HPS

Scalability Improvements in the TAU Performance System for Extreme Scale

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Profile analysis with CUBE. David Böhme, Markus Geimer German Research School for Simulation Sciences Jülich Supercomputing Centre

Energy Efficiency Tuning: READEX. Madhura Kumaraswamy Technische Universität München

Introduction to Parallel Performance Engineering

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW

Výpočetní zdroje IT4Innovations a PRACE pro využití ve vědě a výzkumu

SCORE-P USER MANUAL. 4.0 (revision 13505) Wed May :20:42

Performance analysis on Blue Gene/Q with

Lawrence Livermore National Laboratory Tools for High Productivity Supercomputing: Extreme-scale Case Studies

SCALASCA parallel performance analyses of SPEC MPI2007 applications

Introduction to Performance Engineering

READEX Runtime Exploitation of Application Dynamism for Energyefficient

Parallel I/O on JUQUEEN

Leveraging Parallelware in MAESTRO and EPEEC

UPC Performance Analysis Tool: Status and Plans

Scalable, Automated Parallel Performance Analysis with TAU, PerfDMF and PerfExplorer

Improving Applica/on Performance Using the TAU Performance System

Welcome to the. Jülich Supercomputing Centre. D. Rohe and N. Attig Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich

Usage of the SCALASCA toolset for scalable performance analysis of large-scale parallel applications

Welcome to the. Jülich Supercomputing Centre. D. Rohe and N. Attig Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich

Code Auto-Tuning with the Periscope Tuning Framework

D5.3 Basic Score-P OpenCL support Version 1.0. Document Information

Automatic Tuning of HPC Applications with Periscope. Michael Gerndt, Michael Firbach, Isaias Compres Technische Universität München

ELP. Effektive Laufzeitunterstützung für zukünftige Programmierstandards. Speaker: Tim Cramer, RWTH Aachen University

Virtual Institute High Productivity Supercomputing Code Tuning Tutorial

Characterizing Imbalance in Large-Scale Parallel Programs. David Bo hme September 26, 2013

Cube v4 : From performance report explorer to performance analysis tool

SCORE-P. USER MANUAL 1.3 (revision 7349) Fri Aug :42:08

Hardware Counter Performance Analysis of Parallel Programs

Performance Analysis with Vampir

Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes

HPC IN EUROPE. Organisation of public HPC resources

Operational Robustness of Accelerator Aware MPI

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Change Log Version Description of Change

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Performance Analysis of Large-Scale OpenMP and Hybrid MPI/OpenMP Applications with Vampir NG

Parallel Performance Tools

A configurable binary instrumenter

Performance Analysis with Vampir. Joseph Schuchart ZIH, Technische Universität Dresden

The SCALASCA performance toolset architecture

Virtual Institute High Productivity Supercomputing

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

Performance Analysis with Periscope

Program Development for Extreme-Scale Computing

Pedraforca: a First ARM + GPU Cluster for HPC

meinschaft May 2012 Markus Geimer

SCALASCA v1.0 Quick Reference

An Introduction to OpenACC

Integrating Parallel Application Development with Performance Analysis in Periscope

Automatic Adaption of the Sampling Frequency for Detailed Performance Analysis

Cray XE6 Performance Workshop

The PAPI Cross-Platform Interface to Hardware Performance Counters

Improving the Scalability of Performance Evaluation Tools

HPC Tools on Windows. Christian Terboven Center for Computing and Communication RWTH Aachen University.

VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING. Tools Guide October 2017

I/O at JSC. I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O. Wolfgang Frings

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

NEXTGenIO Performance Tools for In-Memory I/O

Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr

READEX: A Tool Suite for Dynamic Energy Tuning. Michael Gerndt Technische Universität München

A Systematic Multi-step Methodology for Performance Analysis of Communication Traces of Distributed Applications based on Hierarchical Clustering

Tracing the Cache Behavior of Data Structures in Fortran Applications

TAU Performance System Hands on session

Performance Analysis for Large Scale Simulation Codes with Periscope

Parallel Programming with MPI

IBM High Performance Computing Toolkit

Preparing GPU-Accelerated Applications for the Summit Supercomputer

TAU Parallel Performance System. DOD UGC 2004 Tutorial. Part 1: TAU Overview and Architecture

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Update of Post-K Development Yutaka Ishikawa RIKEN AICS

Performance analysis basics

EIOW Exa-scale I/O workgroup (exascale10)

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

DEISA. An European GRID-empowered infrastructure for Science and Industry" Pisa 11 maggio 2005 Angelo De Florio

Distribution of Periscope Analysis Agents on ALTIX 4700

Performance Analysis of Large-scale OpenMP and Hybrid MPI/OpenMP Applications with Vampir NG

Interactive Performance Analysis with Vampir UCAR Software Engineering Assembly in Boulder/CO,

LARGE-SCALE PERFORMANCE ANALYSIS OF SWEEP3D WITH THE SCALASCA TOOLSET

Performance Analysis with Vampir

Performance properties The metrics tour

The Titan Tools Experience

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems

Transcription:

Mitglied der Helmholtz-Gemeinschaft Performance Analysis and Optimization of Scientific Applications on Extreme-Scale Computer Systems Bernd Mohr 1 st Intl. Workshop on Strategic Development of High Performance Computers Tsukuba, March 18-19, 2013

Parallel Architectures: State of the Art Router Router Router Router Router Router Router or Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Network or Switch SMP Memory Interconnect A 0... Interconnect A m Memory A 0... A m... P 0 Interconnect... P n A 0... A m P 0... P n... Memory N 0 P 0 P n N 1 N k NUMA L3 0 L2 0 L2 r/2 L1 0 L1 1... L1 Core 0 Core 1... Core r P i A j IWSDHPC, Tsukuba, 2013 JSC 2

Exascale Performance Challenges Exascale systems will consist of Complex configurations With a huge number of components Very likely heterogeneous With not enough memory Dynamically changing configuration due to fault recovery and power saving Deep software hierarchies of large, complex software components will be required to make use of such systems Sophisticated integrated performance measurement, analysis, and optimization capabilities will be required to efficiently operate an Exascale system IWSDHPC, Tsukuba, 2013 JSC 3

Cross-Cutting Considerations Performance-aware design, development and deployment of hardware and software necessary Integration with OS, compilers, middleware and runtime systems required Support for performance observability in HW and SW (runtime) needed Enable performance measurement and optimization in case of HW and SW changes due to faults or power adaptation IWSDHPC, Tsukuba, 2013 JSC 4

Technical Challenges Heterogeneity Extreme Concurrency Perturbation and data volume Drawing insight from measurements Quality information sources This requires tools to be Portable Insightful Scalable Integrated IWSDHPC, Tsukuba, 2013 JSC 5

Not many Tools match these Requirements TAU University of Oregon, US http://tau.uoregon.edu HPCToolkit Rice University, US http://hpctoolkit.org Extrae/Paraver Barcelona Supercomputing Centre, Spain http://www.bsc.es/paraver Vampir toolset Technical University of Dresden, Germany http://www.vampir.eu Scalasca Jülich Supercomputing Centre, Germany http://www.scalasca.org IWSDHPC, Tsukuba, 2013 JSC 6

Run everywhere PORTABILITY IWSDHPC, Tsukuba, 2013 JSC 7

Scalasca: Supported Platforms Instrumentation and measurement only (visual analysis on front-end or workstation) Cray XT3/XT4/XT5, XE6, XK6 IBM BlueGene/L, BlueGene/P, BlueGene/Q NEC SX8 and SX9 K Machine, Fujitsu FX10 Intel MIC Full support (instrumentation, measurement, and automatic analysis) Linux IA32, IA64, x86_64, and PPC based clusters IBM AIX Power3/4/5/6/7 based clusters SGI Linux IA64 and x86_64 based clusters SUN/Oracle Solaris Sparc and x86/x86_64 based clusters IWSDHPC, Tsukuba, 2013 JSC 8

Now also on the K Computer Thanks to Tomotake Nakamura + colleagues at RIKEN!! IWSDHPC, Tsukuba, 2013 JSC 9

Known Installations of Scalasca Companies ( France ) Bull Dassault Aviation ( France ) EDF (France) ( Germany ) GNS ( Germany ) MAGMA ( Germany ) RECOM ( Netherlands ) Shell ( USA ) Sun Microsystems ( UK ) Qontix Research / HPC Centres ( USA ) ANL ( Spain ) BSC ( France ) CEA ( France ) CERFACS ( Italy ) CINECA ( Finland ) CSC ( Switzerland ) CSCS Research / HPC Centres (cont.) ( Germany ) DLR ( Germany ) DKRZ ( UK ) EPCC ( Germany ) HLRN ( Germany ) HLRS ( Ireland ) ICHEC ( France ) IDRIS JSCC (Russia) ( USA ) LLNL ( Germany ) LRZ MSU (Russia) ( USA ) NCAR ( USA ) NCSA NSCC (China) ( USA ) ORNL ( USA ) PSC ( Germany ) RZG RIKEN (Japan) Research / HPC Centres (cont.) ( Netherlands ) SARA ( Bulgaria ) SAITC ( USA ) TACC Universities ( USA ) RPI ( Germany ) RWTH ( Germany ) TUD ( USA ) UOregon ( USA ) UTK DoD Computing Centers (USA) AFRL DSRC ARL DSRC ARSC DSRC ERDC DSRC Navy DSRC MHPCC DSRC SSC-Pacific IWSDHPC, Tsukuba, 2013 JSC 10

More than numbers and diagrams INSIGHTFULNESS IWSDHPC, Tsukuba, 2013 JSC 11

A picture is worth 1000 words MPI ring program Real world example IWSDHPC, Tsukuba, 2013 JSC 12

What about 1000 s of pictures? (with 100 s of menu options) IWSDHPC, Tsukuba, 2013 JSC 13

Example Automatic Analysis: Late Sender IWSDHPC, Tsukuba, 2013 JSC 14

process process process process Scalasca: Example MPI Patterns (a) Late Sender time (b) Late Receiver time (c) Late Sender / Wrong Order time (d) Wait at N x N time ENTER EXIT SEND RECV COLLEXIT IWSDHPC, Tsukuba, 2013 JSC 15

Scalasca Example: CESM Sea Ice Module Late Sender Analysis IWSDHPC, Tsukuba, 2013 JSC 16

Scalasca Example: CESM Sea Ice Module Late Sender Analysis + Application Topology IWSDHPC, Tsukuba, 2013 JSC 17

Scalasca Root Cause Analysis Root-cause analysis Wait states typically caused by load or communication imbalances earlier in the program Waiting time can also propagate (e.g., indirect waiting time) Goal: Enhance performance analysis to find the root cause of wait states cause Approach Distinguish between direct and indirect waiting time Identify call path/process combinations delaying other processes and causing first order waiting time Identify original delay A foo DELAY bar Send B foo bar Recv Send C foo bar Recv Indirect wait Direct wait time IWSDHPC, Tsukuba, 2013 JSC 18

Scalasca Example: CESM Sea Ice Module Direct Wait Time Analysis IWSDHPC, Tsukuba, 2013 JSC 19

Scalasca Example: CESM Sea Ice Module Indirect Wait Time Analysis IWSDHPC, Tsukuba, 2013 JSC 20

Scalasca Example: CESM Sea Ice Module Delay Costs Analysis IWSDHPC, Tsukuba, 2013 JSC 21

To infinity and beyond EXTREME CONCURRENCY IWSDHPC, Tsukuba, 2013 JSC 22

Scaling already important TODAY! Number of Cores share for TOP 500 November 2012 NCore Count NCore Share Rmax Share 1025-2048 1 0.2% 122 TF 0.1% 1,280 2049-4096 2 0.4% 155 TF 0.1% 7,104 4097-8192 81 16.2% 8,579 TF 5.3% 551,624 8193-16384 206 41.2% 24,543 TF 15.1% 2,617,986 > 16384 210 42.0% 128,574 TF 79.4% 11,707,806 Total 500 100% 161,973 TF 100% 14,885,800 Average system size: 29,772 cores Median system size: 15,360 cores IWSDHPC, Tsukuba, 2013 JSC 23

Personal Motivation (I) 07 / 2012 Jugene 72 rack IBM BlueGene/P 294,912 cores Most parallel system in the world 06/2009 to 06/2011!!! IWSDHPC, Tsukuba, 2013 JSC 24

Personal Motivation (II) Juqueen 28 rack IBM BlueGene/Q 458,752 cores 1,835,008 HW threads IWSDHPC, Tsukuba, 2013 JSC 25

Roads to Scalability Scalable data collection and reduction Automatic detection of most important execution phases (Paraver) Parallel collection and reduction based on MPI and parallel I/O (All tools) Scalable parallel data analysis Parallel client/server processing and visualization (Vampir) Parallel pattern search, delay and critical-path analysis (Scalasca) Parallel analyzer and visualizer (Paraver) Scalable visualizations 3D charts and topology displays (TAU, Scalasca) Hierarchical browsers (Scalasca) IWSDHPC, Tsukuba, 2013 JSC 26

TAU ParaProf: 3D Profile, Miranda, 16K PEs IWSDHPC, Tsukuba, 2013 JSC 27

TAU 3D Topology view / distribution histogram IWSDHPC, Tsukuba, 2013 JSC 28

VampirServer BETA: Trace Visualization S3D@200,448 OTF trace of 4.5 TB VampirServer running with 20,000 analysis processes IWSDHPC, Tsukuba, 2013 JSC 29

Paraver Data Reduction Features Accumulation of values using software counters Powerful filtering expression over time, processors, states, communications, events Automatic structure / phase detection Based on signal processing Using wavelets (Casas: ParCo 2007) Using autocorrelation functions (Casas: Euro-Par 2007) Also used for cleanup: Preemptions Clogged systems / instrumentation overhead Flushing IWSDHPC, Tsukuba, 2013 JSC 30

Scalasca trace analysis sweep3d@294,912 BGP 10 min sweep3d runtime 11 sec replay 4 min trace data write/read (576 files) 7.6 TB buffered trace data 510 billion events B. J. N. Wylie, M. Geimer, B. Mohr, D. Böhme, Z.Szebenyi, F. Wolf: Largescale performance analysis of Sweep3D with the Scalasca toolset. Parallel Processing Letters, 20(4):397-414, 2010. IWSDHPC, Tsukuba, 2013 JSC 31

Scalasca trace analysis bt-mz@524,288 BGQ IWSDHPC, Tsukuba, 2013 JSC 32

Scalasca trace analysis bt-mz@1,048,704 BGQ IWSDHPC, Tsukuba, 2013 JSC 33

Together we are strong INTEGRATION IWSDHPC, Tsukuba, 2013 JSC 34

Integration Need integrated tool (environment) for all levels of parallelization Inter-node (MPI) Intra-node (OpenMP, task-based programming) Accelerators (CUDA, OpenCL) Integration with performance modeling and prediction No tool fits all requirements Interoperability of tools Integration via open interfaces IWSDHPC, Tsukuba, 2013 JSC 35

Scalasca TAU VAMPIR Paraver Extrae Status End 2011 PRV trace Paraver TAU VT Vampir Trace TAU TRACE TAU trace X X X X X OTF / VTF3 trace VAMPIR R R Scalasca EPILOG trace Trace Analyzer CUBE3 profile CUBE3 Presenter TAU EPILOG TAU PROFILE TAU profile X X gprof/mpip/ profile PerfDMF PARAPROF IWSDHPC, Tsukuba, 2013 JSC 36

Scalasca TAU VAMPIR Paraver Extrae Status Begin 2013 PRV trace Paraver X VAMPIR R R Score-P OTF2 trace Scalasca Trc Analyzer CUBE4 profile CUBE4 Presenter TAU SCOREP gprof/mpip/ profile PerfDMF PARAPROF IWSDHPC, Tsukuba, 2013 JSC 37

Tool Integration: Score-P Objectives Mainly funded by SILC, LMAC (BMBF) + PRIMA (DOE) projects Make common part of Periscope, Scalasca, TAU, and Vampir a community effort Score-P measurement system Functional requirements Performance data: profiles (CUBE4), traces (OTF2) Initially direct instrumentation, later also sampling Offline and online access Metrics: time, communication metrics and hardware counters Initially MPI 2 and OpenMP 3, later also CUDA and OpenCL Current release: V1.1.1 of Feb 2013 http://www.score-p.org IWSDHPC, Tsukuba, 2013 JSC 38

Score-P Architecture Vampir Scalasca TAU Periscope Event traces (OTF2) Call-path profiles (CUBE4) Online interface Hardware counter (PAPI) Memory management Score-P measurement infrastructure etc Compiler TAU instrumentor COBI (binary) MAQAO instrumentor OPARI 2 (OpenMP) MPI wrappers Instrumentation Application (MPI, OpenMP, hybrid) IWSDHPC, Tsukuba, 2013 JSC 39

Score-P Partners Forschungszentrum Jülich, Germany German Research School for Simulation Sciences, Aachen, Germany Gesellschaft für numerische Simulation mbh Braunschweig, Germany RWTH Aachen, Germany Technische Universität Dresden, Germany Technische Universität München, Germany University of Oregon, Eugene, USA IWSDHPC, Tsukuba, 2013 JSC 40

Funded Integration Projects SILC (01/2009 to 12/2011) Unified measurement system (Score-P) for Vampir, Scalasca, Periscope PRIMA (08/2009 to 0?/2013) Integration of TAU and Scalasca LMAC (08/2011 to 07/2013) Evolution of Score-P Analysis of performance dynamics H4H (10/2010 to 09/2013) Hybrid programming for heterogeneous platforms HOPSA (02/2011 to 01/2013) Integration of system and application monitoring IWSDHPC, Tsukuba, 2013 JSC 41

Integration of Score-P based Tools Threadspotter Application measured with ThreadSpotter Memory profile Link Explore memory behavior Cube HOPSA Workflow Application linked to Score-P Profile CUBE-4 Scalasca waitstate analysis Visual exploration profile generation Worst-instance visualization What-if scenarios Dimemas OTF2 to PRV conversion Trace PRV Trace OTF-2 Visual exploration Paraver done to do LAPTA System metrics Vampir Application measured with Extrae IWSDHPC, Tsukuba, 2013 JSC 42

Scalasca Vampir/Paraver integration IWSDHPC, Tsukuba, 2013 JSC 43

Scalasca Vampir/Paraver integration IWSDHPC, Tsukuba, 2013 JSC 44

Scalasca Vampir/Paraver integration IWSDHPC, Tsukuba, 2013 JSC 45

Future Work OPEN ISSUES IWSDHPC, Tsukuba, 2013 JSC 46

Biggest Open Issues How to handle asynchronous non-deterministic executions? Currently favored programming model at node-level Breaks measure-analyze-optimize cycle Potential solution Use traditional tools only at inter-node level Use auto-tuning smart runtime systems inside node Further factors to non-determinism Failing components and recovery actions Components operating on varying speeds to save energy IWSDHPC, Tsukuba, 2013 JSC 47

Acknowledgements Scalasca team (JSC) (GRS) Markus Geimer Jie Jiang Michael Knobloch Daniel Lorenz Bernd Mohr Peter Philippen Christian Rössel David Böhme Marc-André Hermanns Alexandru Calotoiu Marc Schlütter Pavel Saviankou Alexandre Strube Brian Wylie Anke Visser Ilja Zhukov Monika Lücke Aamer Shah Felix Wolf Sponsors IWSDHPC, Tsukuba, 2013 JSC 48