Scalasca performance properties The metrics tour

Similar documents
Performance properties The metrics tour

Performance properties The metrics tour

Performance properties The metrics tour

Scalasca performance properties The metrics tour

Basic Communication Operations (Chapter 4)

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1

MPI Collective communication

Outline. Communication modes MPI Message Passing Interface Standard

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III)

Computer Architecture

CINES MPI. Johanne Charpentier & Gabriel Hautreux

MA471. Lecture 5. Collective MPI Communication

DPHPC Recitation Session 2 Advanced MPI Concepts

General Purpose Timing Library (GPTL)

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications

Performance Analysis of Parallel Applications Using LTTng & Trace Compass

CS 6230: High-Performance Computing and Parallelization Introduction to MPI

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Outline. Communication modes MPI Message Passing Interface Standard. Khoa Coâng Ngheä Thoâng Tin Ñaïi Hoïc Baùch Khoa Tp.HCM

Synchronizing the Record Replay engines for MPI Trace compression. By Apoorva Kulkarni Vivek Thakkar

Practical Scientific Computing: Performanceoptimized

Cornell Theory Center. Discussion: MPI Collective Communication I. Table of Contents. 1. Introduction

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Basic MPI Communications. Basic MPI Communications (cont d)

High Performance Computing

MPI 5. CSCI 4850/5850 High-Performance Computing Spring 2018

Standard MPI - Message Passing Interface

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model

COSC 6374 Parallel Computation. Remote Direct Memory Acces

Automatic trace analysis with the Scalasca Trace Tools

A Message Passing Standard for MPP and Workstations

Intra and Inter Communicators

Introduction to Parallel Programming

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Paul Burton April 2015 An Introduction to MPI Programming

Distributed Memory Programming with MPI

SPEC MPI2007 Benchmarks for HPC Systems

SCALASCA parallel performance analyses of SPEC MPI2007 applications

Számítogépes modellezés labor (MSc)

Recap of Parallelism & MPI

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Programming with MPI Collectives

Performance Analysis with Periscope

Part - II. Message Passing Interface. Dheeraj Bhardwaj

東京大学情報基盤中心准教授片桐孝洋 Takahiro Katagiri, Associate Professor, Information Technology Center, The University of Tokyo

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

MPI - Today and Tomorrow

Message Passing Interface

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

Using Lamport s Logical Clocks

Collective Communication in MPI and Advanced Features

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Parallel Programming with MPI MARCH 14, 2018

Parallel Computing and the MPI environment

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

Intermediate MPI features

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

High-Performance Computing: MPI (ctd)

MPI Message Passing Interface

Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector

Parallel programming MPI

Overview of MPI 國家理論中心數學組 高效能計算 短期課程. 名古屋大学情報基盤中心教授片桐孝洋 Takahiro Katagiri, Professor, Information Technology Center, Nagoya University

Week 3: MPI. Day 02 :: Message passing, point-to-point and collective communications

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

Understanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E

Distributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello

Message-Passing Computing

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

In the simplest sense, parallel computing is the simultaneous use of multiple computing resources to solve a problem.

CME 213 S PRING Eric Darve

Instrumentation. BSC Performance Tools

AgentTeamwork Programming Manual

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

Introduction to Parallel Performance Engineering

MPI-3 One-Sided Communication

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

CS 179: GPU Programming. Lecture 14: Inter-process Communication

COMP 322: Fundamentals of Parallel Programming

Message Passing Interface. most of the slides taken from Hanjun Kim

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

Parallel Programming in C with MPI and OpenMP

Document Classification

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Using partial reconfiguration in an embedded message-passing system. FPGA UofT March 2011

MPI One sided Communication

IBM HPC Development MPI update

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

LS-DYNA Scalability Analysis on Cray Supercomputers

Framework of an MPI Program

COMP528: Multi-core and Multi-Processor Computing

High Performance Computing: Tools and Applications

Claudio Chiaruttini Dipartimento di Matematica e Informatica Centro Interdipartimentale per le Scienze Computazionali (CISC) Università di Trieste

AMath 483/583 Lecture 21

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Practical Introduction to Message-Passing Interface (MPI)

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Transcription:

Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de

Scalasca analysis result

Generic metrics

Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware counters Comp. Imbalance Execution time without overhead Time spent in tasks related to measurement (Does not include per-function perturbation!) Number of times a function/region was executed Aggregated counter values for each function/region Simple load imbalance heuristic

Computational imbalance Simple load imbalance heuristic Focuses only on computational parts Easy to calculate Absolute difference to average exclusive execution time Captures global imbalances Based on entire measurement Does not compare individual instances of function calls High value = Imbalance in the sub-calltree underneath Expand the subtree to find the real location of the imbalance

MPI-related metrics

MPI Time hierarchy MPI Synchronization Collective RMA Active Target Passive Target Communication Point-to-point Collective RMA File I/O Collective Init/Exit

MPI Time hierarchy details MPI Synchronization Communication File I/O Init/Exit Time spent in pre-instrumented MPI functions Time spent in calls to MPI_Barrier or Remote Memory Access synchronization calls Time spent in MPI communication calls, subdivided into collective, point-to-point and RMA Time spent in MPI file I/O functions, with specialization for collective I/O calls Time spent in MPI_Init and MPI_Finalize

MPI Synchronizations hierarchy Synchronizations Point-to-point Collective RMA Sends Receives Fences GATS Epochs Locks Provides the number of calls to an MPI synchronization function of the corresponding class Synchronizations include zero-sized message transfers! Access Epochs Exposure Epochs

MPI Communications hierarchy Communications Point-to-point Sends Receives Collective Exchange As source As destination RMA Puts Gets Provides the number of calls to an MPI communication function of the corresponding class Zero-sized message transfers are considered synchronization!

MPI Transfer hierarchy Bytes transferred Point-to-point Collective RMA Sent Received Outgoing Incoming Sent Received Provides the number of bytes transferred by an MPI communication function of the corresponding class

MPI File operations hierarchy MPI file operations Individual Collective Reads Writes Reads Writes Provides the number of calls to MPI file I/O functions of the corresponding class

MPI collective synchronization time MPI Synchronization Collective Wait at Barrier Barrier Completion RMA Communication File I/O Init/Exit

Wait at Barrier location MPI_Barrier MPI_Barrier MPI_Barrier MPI_Barrier time Time spent waiting in front of a barrier call until the last process reaches the barrier operation Applies to: MPI_Barrier

Barrier Completion location MPI_Barrier MPI_Barrier MPI_Barrier MPI_Barrier time Time spent in barrier after the first process has left the operation Applies to: MPI_Barrier

MPI collective communication time MPI Synchronization Communication Point-to-point Collective Early Reduce Early Scan Late Broadcast Wait at N x N N x N Completion RMA

Wait at N x N location MPI_Allreduce MPI_Allreduce MPI_Allreduce MPI_Allreduce Time spent waiting in front of a synchronizing collective operation call until the last process reaches the operation Applies to: MPI_Allgather, MPI_Allgatherv, MPI_Allreduce, MPI_Alltoall, MPI_Reduce_scatter, MPI_Reduce_scatter_block time

N x N Completion location MPI_Allreduce MPI_Allreduce MPI_Allreduce MPI_Allreduce time Time spent in synchronizing collective operations after the first process has left the operation Applies to: MPI_Allgather, MPI_Allgatherv, MPI_Allreduce, MPI_Alltoall, MPI_Reduce_scatter, MPI_Reduce_scatter_block

Late Broadcast location MPI_Bcast MPI_Bcast (root) MPI_Bcast MPI_Bcast Waiting times if the destination processes of a collective 1-to-N communication operation enter the operation earlier than the source process (root) Applies to: MPI_Bcast, MPI_Scatter, MPI_Scatterv time

Early Reduce location MPI_Reduce MPI_Reduce MPI_Reduce (root) MPI_Reduce Waiting time if the destination process (root) of a collective N-to-1 communication operation enters the operation earlier than its sending counterparts Applies to: MPI_Reduce, MPI_Gather, MPI_Gatherv time

Early Scan location MPI_Scan MPI_Scan 0 1 MPI_Scan 2 MPI_Scan 3 Waiting time if process n enters a prefix reduction operation earlier than its sending counterparts (i.e., ranks 0..n-1) Applies to: MPI_Scan, MPI_Exscan time

MPI point-to-point communication time MPI Synchronization Communication Point-to-point Late Sender Msg. in Wrong Order Same Source Different Source Late Receiver Collective RMA

Late Sender location MPI_Send MPI_Send MPI_Recv MPI_Irecv MPI_Wait time location MPI_Isend MPI_Wait MPI_Isend MPI_Wait MPI_Recv MPI_Irecv MPI_Wait time Waiting time caused by a blocking receive operation posted earlier than the corresponding send operation Applies to blocking as well as non-blocking communication

Late Sender (II) location MPI_Send MPI_Send MPI_Irecv MPI_Waitall time While waiting for several messages, the maximum waiting time is accounted Applies to: MPI_Waitall, MPI_Waitsome

Late Sender, Messages in Wrong Order location MPI_Send MPI_Send MPI_Send MPI_Send MPI_Recv MPI_Recv MPI_Recv MPI_Recv time Refers to Late Sender situations which are caused by messages received in wrong order Comes in two flavours: Messages sent from same source location Messages sent from different source locations

Late Receiver location MPI_Send MPI_Recv MPI_Irecv MPI_Send MPI_Wait time Waiting time caused by a blocking send operation posted earlier than the corresponding receive operation Calculated by receiver but waiting time attributed to sender Does currently not apply to non-blocking sends

Late Sender/Receiver Counts The number of Late Sender / Late Receiver instances are also available They are divided into communications & synchronizations and shown in the corresponding hierarchies

MPI RMA synchronization time MPI Synchronization Collective RMA Active Target Late Post Wait at Fence Early Fence Early Wait Late Complete Passive Target

Late Post location Win_start Put Win_post Win_complete Win_wait Win_start Put Win_complete time MPI_Win_start (top) or MPI_Win_complete (bottom) wait until exposure epoch is opened by MPI_Win_post Which of the two calls blocks is implementation dependent

Wait at Fence location Put Win_fence Win_fence time Time spent waiting in front of a synchronizing MPI_Win_fence call until the last process reaches the fence operation Only triggered if at least one of the following conditions applies Given assertion is 0 All fence calls overlap (heuristic)

Early Fence location Put Win_fence Win_fence time Time spent waiting for exit of last RMA operation to target location Sub-pattern of Wait at Fence

Early Wait location Win_start Put Win_complete Win_start Put Win_complete Win_post Win_wait time Time spent in MPI_Win_wait until access epoch is closed by last MPI_Win_complete

Late Complete location Win_start Put Win_complete Win_start Put Win_complete Win_post Win_wait time Waiting time due to unnecessary pause between last RMA operation to target and closing the access epoch by last MPI_Win_complete Sub-pattern of Early Wait

MPI RMA communication time MPI Synchronization Communication Point-to-point Collective RMA Early Transfer

Early Transfer location Win_start Put Win_complete Win_post Win_wait time Time spent waiting in RMA operation on origin(s) started before exposure epoch was opened on target

OpenMP-related metrics

OpenMP Time hierarchy Time Execution MPI OMP Flush Management Synchronization Overhead Idle Threads Limited parallelism Fork

OpenMP Time hierarchy details OMP Flush Synchronization Time spent for OpenMP-related tasks Time spent in OpenMP flush directives Time spent to synchronize OpenMP threads

OpenMP Management Time location serial parallel region body serial parallel region body parallel region body time Time spent on master thread for creating/destroying OpenMP thread teams

OpenMP Fork Time location serial parallel region body serial parallel region body parallel region body time Time spent on master threads for creating OpenMP thread teams

OpenMP Idle Threads location serial parallel region body serial parallel region body parallel region body time Time spent idle on CPUs reserved for worker threads

OpenMP Limited Parallelism location serial parallel region body serial parallel region body parallel region body time Time spent idle on worker threads within parallel regions

OpenMP Synchronization Time hierarchy OMP Flush Management Synchronization Barrier Critical Lock API Explicit Implicit Time spent in OpenMP atomic constructs is attributed to the Critical metric

OpenMP-related metrics (as produced by Scalasca's sequential trace analyzer for OpenMP and hybrid MPI/OpenMP applications)

OpenMP Time hierarchy Time Execution Idle Threads Overhead MPI OpenMP Synchronization Fork Flush

OpenMP Time hierarchy details OpenMP Synchronization Fork Flush Idle Threads Time spent for OpenMP-related tasks Time spent for synchronizing OpenMP threads Time spent by master thread to create thread teams Time spent in OpenMP flush directives Time spent idle on CPUs reserved for worker threads

OpenMP synchronization time OpenMP Synchronization Barrier Explicit Implicit Lock Competition API Critical Wait at Barrier Wait at Barrier

Wait at Barrier location OpenMP barrier OpenMP barrier OpenMP barrier OpenMP barrier time Time spent waiting in front of a barrier call until the last process reaches the barrier operation Applies to: Implicit/explicit barriers

Lock Competition location Acquire Lock Release Lock Acquire Lock Release Lock time Time spent waiting for a lock that has been previously acquired by another thread Applies to: critical sections, OpenMP lock API