Introduction to Parallel Performance Engineering

Similar documents
Introduction to Performance Engineering

Performance analysis basics

Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW

Introduction to Parallel Performance Engineering with Score-P. Marc-André Hermanns Jülich Supercomputing Centre

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1

Performance Profiling

Computer Organization: A Programmer's Perspective

Performance properties The metrics tour

Automatic trace analysis with the Scalasca Trace Tools

Performance properties The metrics tour

Performance Analysis of Parallel Scientific Applications In Eclipse

Analysis report examination with Cube

Review: Creating a Parallel Program. Programming for Performance

Performance properties The metrics tour

Performance Tools for Technical Computing

In examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured

Scalasca performance properties The metrics tour

Martin Kruliš, v

Detection and Analysis of Iterative Behavior in Parallel Applications

Introduction to parallel Computing

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

CSE 120 Principles of Operating Systems

Tools and Methodology for Ensuring HPC Programs Correctness and Performance. Beau Paisley

Method-Level Phase Behavior in Java Workloads

Basics of Performance Engineering

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Effective Performance Measurement and Analysis of Multithreaded Applications

Lecture 3: Intro to parallel machines and models

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Software Analysis. Asymptotic Performance Analysis

2 TEST: A Tracer for Extracting Speculative Threads

Cache Performance Analysis with Callgrind and KCachegrind

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE

Oracle Developer Studio Performance Analyzer

Caching Basics. Memory Hierarchies

ClearSpeed Visual Profiler

High Performance Computing. University questions with solution

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

AutoTune Workshop. Michael Gerndt Technische Universität München

SCALASCA v1.0 Quick Reference

Anna Morajko.

Parallel Performance and Optimization

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

CS533 Modeling and Performance Evaluation of Network and Computer Systems

Analysis report examination with CUBE

Improving Applica/on Performance Using the TAU Performance System

CS 326: Operating Systems. Process Execution. Lecture 5

PRACE Autumn School Basic Programming Models

Performance Tuning VTune Performance Analyzer

DSP/BIOS Kernel Scalable, Real-Time Kernel TM. for TMS320 DSPs. Product Bulletin

Scalasca performance properties The metrics tour

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Announcements. Homework 4 out today Dec 7 th is the last day you can turn in Lab 4 and HW4, so plan ahead.

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Παράλληλη Επεξεργασία

OPENMP TIPS, TRICKS AND GOTCHAS

Profiling & Optimization

Operating Systems. Designed and Presented by Dr. Ayman Elshenawy Elsefy

Profiling with TAU. Le Yan. User Services LSU 2/15/2012

Programming Models for Supercomputing in the Era of Multicore

Why do we care about parallel?

Simplifying the Development and Debug of 8572-Based SMP Embedded Systems. Wind River Workbench Development Tools

Performance Analysis with Periscope

Optimisation. CS7GV3 Real-time Rendering

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

CSC630/CSC730 Parallel & Distributed Computing

Vertical Profiling: Understanding the Behavior of Object-Oriented Applications

Allinea Unified Environment

Performance Analysis

SCALASCA parallel performance analyses of SPEC MPI2007 applications

CPSC 341 OS & Networks. Introduction. Dr. Yingwu Zhu

Slide 6-1. Processes. Operating Systems: A Modern Perspective, Chapter 6. Copyright 2004 Pearson Education, Inc.

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Workload Characterization using the TAU Performance System

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

BİL 542 Parallel Computing

IBM High Performance Computing Toolkit

Parallel and High Performance Computing CSE 745

Profiling & Optimization

Curriculum 2013 Knowledge Units Pertaining to PDC

Zing Vision. Answering your toughest production Java performance questions

Control Hazards. Branch Prediction

Heckaton. SQL Server's Memory Optimized OLTP Engine

Rsyslog: going up from 40K messages per second to 250K. Rainer Gerhards

Lectures Parallelism

JiST Java in Simulation Time An efficient, unifying approach to simulation using virtual machines

Tau Introduction. Lars Koesterke (& Kent Milfeld, Sameer Shende) Cornell University Ithaca, NY. March 13, 2009

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Operating Systems (2INC0) 2018/19. Introduction (01) Dr. Tanir Ozcelebi. Courtesy of Prof. Dr. Johan Lukkien. System Architecture and Networking Group

Parallel Performance and Optimization

Processes and Threads

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Performance analysis with Periscope

Chapter 2: Memory Hierarchy Design Part 2

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

PCS - Part Two: Multiprocessor Architectures

Transcription:

Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray)

Performance: an old problem Difference Engine The most constant difficulty in contriving the engine has arisen from the desire to reduce the time in which the calculations were executed to the shortest which is possible. Charles Babbage 1791 1871 2

Today: the free lunch is over Moore's law is still in charge, but Clock rates no longer increase Performance gains only through increased parallelism Optimizations of applications more difficult Increasing application complexity Multi-physics Multi-scale Increasing machine complexity Hierarchical networks / memory More CPUs / multi-core Every doubling of scale reveals a new bottleneck! 3

Example: XNS CFD simulation of unsteady flows Developed by CATS / RWTH Aachen Exploits finite-element techniques, unstructured 3D meshes, iterative solution strategies MPI parallel version >40,000 lines of Fortran & C DeBakey blood-pump data set (3,714,611 elements) Partitioned finite-element mesh Hæmodynamic flow pressure distribution 4

XNS wait-state analysis on BG/L (2007) 5

Performance factors of parallel applications Sequential factors Computation Choose right algorithm, use optimizing compiler Cache and memory Tough! Only limited tool support, hope compiler gets it right Input / output Often not given enough attention Parallel factors Partitioning / decomposition Communication (i.e., message passing) Multithreading Synchronization / locking More or less understood, good tool support 6

Tuning basics Successful engineering is a combination of The right algorithms and libraries Compiler flags and directives Thinking!!! Measurement is better than guessing To determine performance bottlenecks To compare alternatives To validate tuning decisions and optimizations After each step! 7

However We should forget about small efficiencies, say 97% of the time: premature optimization is the root of all evil. Charles A. R. Hoare It's easier to optimize a slow correct program than to debug a fast incorrect one Nobody cares how fast you can compute a wrong answer... 8

Performance engineering workflow Preparation Prepare application (with symbols), insert extra code (probes/hooks) Measurement Collection of data relevant to execution performance analysis Analysis Calculation of metrics, identification of performance metrics Examination Presentation of results in an intuitive/understandable form Optimization Modifications intended to eliminate/reduce performance problems 9

The 80/20 rule Programs typically spend 80% of their time in 20% of the code Programmers typically spend 20% of their effort to get 80% of the total speedup possible for the application Know when to stop! Don't optimize what does not matter Make the common case fast! If you optimize everything, you will always be unhappy. Donald E. Knuth 10

Metrics of performance What can be measured? A count of how often an event occurs E.g., the number of MPI point-to-point messages sent The duration of some interval E.g., the time spent these send calls The size of some parameter E.g., the number of bytes transmitted by these calls Derived metrics E.g., rates / throughput Needed for normalization 11

Example metrics Execution time Number of function calls CPI CPU cycles per instruction FLOPS Floating-point operations executed per second math Operations? HW Operations? HW Instructions? 32-/64-bit? 12

Execution time Wall-clock time Includes waiting time: I/O, memory, other system activities In time-sharing environments also the time consumed by other applications CPU time Time spent by the CPU to execute the application Does not include time the program was context-switched out Problem: Does not include inherent waiting time (e.g., I/O) Problem: Portability? What is user, what is system time? Problem: Execution time is non-deterministic Use mean or minimum of several runs 13

Inclusive vs. Exclusive values Inclusive Information of all sub-elements aggregated into single value Exclusive Information cannot be subdivided further int foo() { int a; a = 1 + 1; Inclusive Exclusive bar(); } a = a + 1; return a; 14

Classification of measurement techniques How are performance measurements triggered? Sampling Code instrumentation How is performance data recorded? Profiling / Runtime summarization Tracing How is performance data analyzed? Online Post mortem 15

Sampling t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Time main foo(0) foo(1) foo(2) Measurement int main() { int i; } for (i=0; i < 3; i++) foo(i); return 0; void foo(int i) { } if (i > 0) foo(i 1); Running program is periodically interrupted to take measurement Timer interrupt, OS signal, or HWC overflow Service routine examines return-address stack Addresses are mapped to routines using symbol table information Statistical inference of program behavior Not very detailed information on highly volatile metrics Requires long-running applications Works with unmodified executables 16

Instrumentation t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 Time main foo(0) foo(1) foo(2) Measurement int main() { int i; Enter( main ); for (i=0; i < 3; i++) foo(i); Leave( main ); return 0; } void foo(int i) { Enter( foo ); if (i > 0) foo(i 1); Leave( foo ); } Measurement code is inserted such that every event of interest is captured directly Can be done in various ways Advantage: Much more detailed information Disadvantage: Processing of source-code / executable necessary Large relative overheads for small functions 17

Instrumentation techniques Static instrumentation Program is instrumented prior to execution Dynamic instrumentation Program is instrumented at runtime Code is inserted Manually Automatically By a preprocessor / source-to-source translation tool By a compiler By linking against a pre-instrumented library / runtime system By binary-rewrite / dynamic instrumentation tool 18

Critical issues Accuracy Intrusion overhead Measurement itself needs time and thus lowers performance Perturbation Measurement alters program behaviour E.g., memory access pattern Accuracy of timers & counters Granularity How many measurements? How much information / processing during each measurement? Tradeoff: Accuracy vs. Expressiveness of data 19

Classification of measurement techniques How are performance measurements triggered? Sampling Code instrumentation How is performance data recorded? Profiling / Runtime summarization Tracing How is performance data analyzed? Online Post mortem 20

Profiling / Runtime summarization Recording of aggregated information Total, maximum, minimum, For measurements Time Counts Function calls Bytes transferred Hardware counters Over program and system entities Functions, call sites, basic blocks, loops, Processes, threads Profile = summarization of events over execution interval 21

Types of profiles Flat profile Shows distribution of metrics per routine / instrumented region Calling context is not taken into account Call-path profile Shows distribution of metrics per executed call path Sometimes only distinguished by partial calling context (e.g., two levels) Special-purpose profiles Focus on specific aspects, e.g., MPI calls or OpenMP constructs Comparing processes/threads 22

Tracing Recording information about significant points (events) during execution of the program Enter / leave of a region (function, loop, ) Send / receive a message, Save information in event record Timestamp, location, event type Plus event-specific information (e.g., communicator, sender / receiver, ) Abstract execution model on level of defined events Event trace = Chronologically ordered sequence of event records 23

Event tracing Local trace A Process A void foo() { trc_enter("foo");... trc_send(b); send(b, tag, buf);... trc_exit("foo"); } instrument Process B void bar() { trc_enter("bar");... recv(a, tag, buf); trc_recv(a);... trc_exit("bar"); } MONITOR synchronize(d) MONITOR... 58 ENTER 1 62 SEND B 64 EXIT 1... 1 foo... Local trace B... 60 ENTER 1 68 RECV A 69 EXIT 1... 1 bar... Global trace view... 58 A ENTER 1 60 B ENTER 2 62 A SEND B 64 A EXIT 1 68 B RECV A 69 B EXIT 2... merge unify 1 foo 2 bar...

Tracing vs. Profiling Tracing advantages Event traces preserve the temporal and spatial relationships among individual events ( context) Allows reconstruction of dynamic application behaviour on any required level of abstraction Most general measurement technique Profile data can be reconstructed from event traces Disadvantages Traces can very quickly become extremely large Writing events to file at runtime causes perturbation Writing tracing software is complicated Event buffering, clock synchronization,... 25

Classification of measurement techniques How are performance measurements triggered? Sampling Code instrumentation How is performance data recorded? Profiling / Runtime summarization Tracing How is performance data analyzed? Online Post mortem 26

Online analysis Performance data is processed during measurement run Process-local profile aggregation More sophisticated inter-process analysis using Piggyback messages Hierarchical network of analysis agents Inter-process analysis often involves application steering to interrupt and re-configure the measurement 27

Post-mortem analysis Performance data is stored at end of measurement run Data analysis is performed afterwards Automatic search for bottlenecks Visual trace analysis Calculation of statistics 28

Example: Time-line visualization 1 foo 2 bar 3... main foo bar... 58 A ENTER 1 60 B ENTER 2 62 A SEND B 64 A EXIT 1 68 B RECV A 69 B EXIT 2... A B 58 60 62 64 66 68 70 29

No single solution is sufficient! A combination of different methods, tools and techniques is typically needed! Analysis Statistics, visualization, automatic analysis, data mining,... Measurement Sampling / instrumentation, profiling / tracing,... Instrumentation Source code / binary, manual / automatic,... 30

Typical performance analysis procedure Do I have a performance problem at all? Time / speedup / scalability measurements What is the key bottleneck (computation / communication)? MPI / OpenMP / flat profiling Where is the key bottleneck? Call-path profiling, detailed basic block profiling Why is it there? Hardware counter analysis, trace selected parts to keep trace size manageable Does the code have scalability problems? Load imbalance analysis, compare profiles at various sizes function-by-function 31