Anna Morajko.

Similar documents
Performance Strategies for Parallel Mathematical Libraries Based on Historical Knowledgebase

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Automatic generation of dynamic tuning techniques

PETSc Satish Balay, Kris Buschelman, Bill Gropp, Dinesh Kaushik, Lois McInnes, Barry Smith

Introduction to Parallel Performance Engineering

Characterizing Imbalance in Large-Scale Parallel Programs. David Bo hme September 26, 2013

Lecture 15: More Iterative Ideas

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

Adaptive Cluster Computing using JavaSpaces

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

ScalaIOTrace: Scalable I/O Tracing and Analysis

Introduction to Parallel Computing

Interactive Analysis of Large Distributed Systems with Scalable Topology-based Visualization

Performance Analysis with Periscope

Accelerating the Iterative Linear Solver for Reservoir Simulation

AMS526: Numerical Analysis I (Numerical Linear Algebra)

What it Takes to Assign Blame

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Germán Llort

Parallel algorithms for Scientific Computing May 28, Hands-on and assignment solving numerical PDEs: experience with PETSc, deal.

Hybrid MPI + OpenMP Approach to Improve the Scalability of a Phase-Field-Crystal Code

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer

An Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC

Parallel I/O Libraries and Techniques

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Review of previous examinations TMA4280 Introduction to Supercomputing

Dynamic Tuning of Parallel Programs

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

CSE5351: Parallel Processing Part III

6.1 Multiprocessor Computing Environment

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1

ClearSpeed Visual Profiler

Detection and Analysis of Iterative Behavior in Parallel Applications

Harp-DAAL for High Performance Big Data Computing

Parallel Algorithm Design. CS595, Fall 2010

MPI Related Software

Dynamic performance tuning supported by program specification

Lecture 11: Randomized Least-squares Approximation in Practice. 11 Randomized Least-squares Approximation in Practice

University of Florida CISE department Gator Engineering. Clustering Part 4

L20: Putting it together: Tree Search (Ch. 6)!

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications

Tools and Primitives for High Performance Graph Computation

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

Automatic trace analysis with the Scalasca Trace Tools

Clustering Part 4 DBSCAN

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

Performance analysis basics

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE

On the scalability of tracing mechanisms 1

Design of Parallel Programs Algoritmi e Calcolo Parallelo. Daniele Loiacono

Blocking Optimization Strategies for Sparse Tensor Computation

Iterative Sparse Triangular Solves for Preconditioning

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Performance Analysis of Parallel Scientific Applications In Eclipse

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen

NEW ADVANCES IN GPU LINEAR ALGEBRA

Profiling of Data-Parallel Processors

DESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID

Graph Partitioning for Scalable Distributed Graph Computations

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Performance analysis with Periscope

PRIMEHPC FX10: Advanced Software

Dynamic Tuning of Parallel/Distributed Applications

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Philip C. Roth. Computer Science and Mathematics Division Oak Ridge National Laboratory

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Contents. Preface to the Second Edition

CUDA GPGPU Workshop 2012

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Intel Threading Tools

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

"Charting the Course to Your Success!" MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008

Lecture 4: Principles of Parallel Algorithm Design (part 4)

MPI Related Software. Profiling Libraries. Performance Visualization with Jumpshot

Integration of Trilinos Into The Cactus Code Framework

Scientific Computing. Some slides from James Lambers, Stanford

Intel VTune Amplifier XE

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

PetIGA. A Framework for High Performance Isogeometric Analysis. Santa Fe, Argentina. Knoxville, United States. Thuwal, Saudi Arabia

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Post-K Supercomputer Overview. Copyright 2016 FUJITSU LIMITED

Huge market -- essentially all high performance databases work this way

High Performance Computing. University questions with solution

Why Multiprocessors?

ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors

Performance Analysis with Vampir

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Lecture 6: Input Compaction and Further Studies

Transcription:

Performance analysis and tuning of parallel/distributed applications Anna Morajko Anna.Morajko@uab.es 26 05 2008

Introduction Main research projects Develop techniques and tools for application performance improvement Monitoring, analysis, and tuning tools: Kappa Pi post mortem analysis DMA automatic analysis at run time MATE automatic tuning at run time Performance models For applications based on frameworks For scientific libraries

Performance analysis and tuning Post mortem manual analysis Modifications Application development Source r Use Tuning Application Performance analysis Execution Performance data Monitoring ls Too Events Metrics Trace Profile data Visualization Problems / Solutions

Performance analysis and tuning Post mortem automatic analysis Modifications Application development r Use Source Tuning Application Execution Performance data Suggestions for user Monitoring T ools Events Metrics Trace Profile data Kappa Pi Performance analysis

Performance analysis and tuning Run time analysis Modifications Application development r Use Source Tuning Application Execution Performance data Instrumentation Monitoring ls Too Metrics DMA Performance Dynamic Monitor and Analyzer analysis Problems and root causes

Performance analysis and tuning Dynamic tuning Application development r Use Source Application Execution Performance data Instrumentation Monitoring ls Too Modifications MATE Tuning Events Monitoring Analysis Performance and Tuning Environment analysis Problems / Solutions

Performance analysis and tuning What can be tuned in an application? Application code API Framework code API Libraries code OS API Operating System kernel Hardware

Performance analysis and tuning Knowledge representation Measure points Where the instrumentation must be inserted to provide measurements Performance model measurements Determines minimal execution time of the entire application Formulas and conditions for optimal behavior optimal values Tuning points/actions/synchronization What and when can be changed in the application point element that may be changed action what operation to invoke on a point synchronization when a tuning action can be invoked to ensure application correctness

Performance analysis and tuning Application code API Framework code API Libraries code OS API Operating System kernel Hardware Measure points Performance model Tuning point, action, sync Execution Monitoring MATE Performance analysis Tuning

Outline Post mortem analysis Online modeling and analysis Performance models Framework based applications Scientific libraries MATE Future work

Post mortem analysis Kappa Pi tool Knowledge Based Automatic Parallel Program Analyser for Performance Improvement Trace analyzer: automatically detects bottlenecks, relates them to source code and provides suggestions for a user For PVM/MPI applications Rule based inference engine detection based on decision trees Declarative communication problems catalog (XML) Problems declared as structured event patterns Focused on analysis of inefficiency intervals Basic source code analysis

Outline Post mortem analysis Online modeling and analysis Performance models Framework based applications Scientific libraries MATE Future work

Online modeling and analysis DMA: Dynamic Modeler and Analyzer Primary objective Develop a tool that is able to analyze the performance of parallel applications, detect bottlenecks and explain their reasons Our approach Dynamic on the fly analysis DynInst library Online performance modeling captures the application structure and behavior Root cause performance analysis Tool targeted to MPI programs and communication problems

Online modeling and analysis Task Activity Graph (TAG) Abstracts execution of a single task Execution is described by units that correspond to different activities Nodes reflect execution of communication activities and selected loops Edges represent sequential flow of execution (computation activities) Nodes and edges behavior is described by execution profiles Profiles contains aggregated performance metrics TAG maintains happens before relationship between nodes and edges Model is constructed incrementally at run time in memory of each task

Online modeling and analysis Parallel model (PTAG) Individual TAG models connected by message edges (P2P, Collective) enable construction of Parallel TAG (PTAG) This can be done at run time: sender call path id sent with every MPI message PTAG is updated periodically by sampling and merging TAGs

Online modeling and analysis How to find problems and their causes? Root Cause Analysis (RCA) Online approach to performance analysis Aims to find bottlenecks and their causes Focuses on finding causes of latencies Based on TAG/PTAG model RCA is a continuous analysis process that is divided in 3 phases: Phase 1: Identify performance problems Phase 2: Analyze individual problems Phase 3: Correlate individual problems and find their root causes

Online modeling and analysis Phase 1: Performance problems identification A bottleneck is defined as an individual task activity that has accumulated a significant amount of execution time In the TAG model a single bottleneck is identified by a node or an edge The bottleneck can be a symptom of another problem or a root cause Identification (local / global) Capture TAG/PTAG snapshot Rank nodes and edges Select top k candidates Periodic process (moving time window)

Online modeling and analysis Phase 2: Analyze individual problems For each problem detected in phase 1, we investigate the possible higher level causes by exploring a knowledge based cause space. We focus on determining causes that contribute most to the total problem time. For each activity we define a performance model that quantifies its cost Task activity classification used for better problem understanding

Online modeling and analysis Phase 3: Causality paths How to explain causes of latencies? For example: why sender is late? Cause of waiting time between two nodes as the differences between their execution paths We can track activities before the problem occurs on both nodes and compare them We call them causality paths as they have happens before property We consider two sources of causality: Detect causality paths in sender to explain latencies in receiver Sequential flow Message flow

Outline Post mortem analysis Online modeling and analysis Performance models Framework based applications Scientific libraries MATE Future work

Performance models Framework based applications Frameworks help users to develop parallel applications providing APIs that hide low level details (M/W, pipeline, divide and conquer) We can extract the knowledge given by frameworks to define performance models for automatic and dynamic performance tuning (MATE) Examples: M/W framework (UAB), eskel (The Edinburg Skeleton library), llc (La Laguna Compiler) Application code Measure points API Performance model Framework code API Libraries code Tuning point, action, sync OS API Operating System kernel Hardware

Performance models Framework based application example M/W application In p u t d a ta s tru c tu re Com m G en O u tp u t d a ta s tru c tu re F r a m e w o r k c la s s e s U s e r c la s s e s Com m M a s te r W o rk e r Entry, Exit CompFunc() M y M a s te r Problems for tuning: M y W o rk e r Load imbalance Improper number of workers Message size Measure points Entry, Exit Iteration Nopt = (λ *V + Tc ) tl Change variable Nopt Performance model Tuning point, action, sync

Performance models Scientific libraries modeling Libraries help users to solve domain specific problems Using the knowledge learned from scientific programming libraries to define performance models for automatic and dynamic performance tuning (MATE) Application code Measure points Examples: mathematical library PETSc (Portable Extensible Toolkit for Scientific Computation) API Framework code Performance model Tuning point, action, sync API Libraries code OS API Operating System kernel Hardware

Performance models Scientific libraries modeling: PETSc Study of different combinations of Matrix Types, Pre conditioners, and Solvers Other higher Level solvers TS PDE Solvers Time Stepping SNES Mathematical Algorithms KSP Data Structure Matrices Level of abstraction Application Code BLAS SLES PC Vectors LAPACK Draw Index set MPI

Performance models Scientific libraries modeling: PETSc PETSc supports different storage methods (dense, sparse, compressed, etc.) no data structure is appropriate for all problems Linear system calculations involve matrix s preconditioner (PC) and Krylov Subspace method (KSP) Different mathematical algorithms for preconditioner (jacobi, LU, etc.) and KSP (GMRES, CG, etc.) Storage methods and algorithm selection can significantly affect the performance

Performance models Scientific libraries modeling: PETSc We build a knowledge base that maps problems (different classes of matrices, matrix sizes, and other variables) to best configurations An input matrix is compared to the knowledge base using K nearest neighbors algorithm to select the most similar problem and its configuration Load balance applying different number of lines/values

Outline Post mortem analysis Online modeling and analysis Performance models Framework based applications Scientific libraries MATE Future work

MATE Application code API Models for specific apps Models for framework based apps Framework code API Libraries code Models for scientific library Models for any app OS API Operating System kernel Hardware Measure points Performance model Tuning point, action, sync Execution Monitoring MATE Performance analysis Tuning

MATE Tunlets Mechanism for specifying a performance model Analyzer Originally, each tunlet is written as C++ code library Tunlet Simplify tunlet development: Performance Tunlet Specification Language Automatic Tunlet Generator Measure points model Tuning points Application T u n le t S p e c ific a tio n MATE M e a s u r e P o in ts p 1, p 2... p i Framework scientific libs P e r fo r m a n c e F u n c tio n s f1... fj T u n in g A c tio n s /P o in ts tp 1... tp h Application User Tunlet Tunlet Tunlet

MATE MATE scalability Scalability problems with centralized analysis The volume of events gets too high pvmd Machine 1 Task1 DMLib AC Machine 2 AC Task3 Task2 DMLib Actions Events Machine N pvmd AC DMLib Events Actions. Analyzer Task2 DMLib Actions. Events pvmd Events Analyzer machine

MATE MATE scalability Distributed hierarchical collecting preprocessing approach Distribution of event collection Collector Preprocessor preprocessing of certain calculations: cumulative and comparative operations Global Analyzer performance model evaluation pvmd Machine 1 Task 1... AC Task DMLib events Machine q Machine n. AC 3 actions. CP Task 2 DMLib DMLib events pvmd Machine r actions data data Global Analyzer Machine p Tunlets. events CP

MATE MATE in Grid Systems (GMATE) Grid performance models Tool distribution architecture Application Monitoring transparent process tracking and control Cluster B Headnode Cluster C AC should follow application process to any cluster Inter cluster communication with analyzer Lower inter cluster event collection overhead Inter cluster communications generally have high latency and lower bandwidth Remote trace events should be aggregated Analysis in Grid Cluster A job Condor/LSF/PBS

MATE G MATE: Transparent process tracking Application wrapping AC can be binary packaged with application binary AC DMLib new Task Task AC DMLib Task Remote Machine new Task detects Dyninst create Job submission AC Remote Machine create AC Taskn DMLib control Analyzer subscription

MATE G MATE: Analyzer approach Central analyzer Collects performance data in a central site Hierarchical analyzer Local analyzers process local data, generates abstract metrics and send to global analyzers Distributed analyzer Each analyzer evaluates its local performance data and abstract global data cooperating with the rest of analyzers

MATE Tuning scenario Distributed Grid Application Hierarchical Master/Worker Matrix Multiplication Cluster A Master Worker Worker Performance model for: Number of Workers Application Grain Size Worker Worker Common data are transmitted once Cluster B Change compute/communication ratio change maximum number of workers Analyzer Tunlet Performance model Measure points Tuning points Sub Master Worker Worker Worker Sub Master explores data locality Worker

Outline Post mortem analysis Online modeling andanalysis Performance models Framework based applications Scientific libraries MATE Future work

Future work DMA new methodology final phase of development Development of performance models: for applications based on different frameworks Join both strategies: load balance and number of workers Replication/elimination of pipeline stages PETSc based applications under construction for black box applications for grid applications under construction Development of tuning techniques

Future work MATE s analysis improvement Performance model of the number of CPs Hierarchy at the CPs level as the number of machines increases (MRNet?) Root cause analysis? Instrumentation evaluation Co execution of tunlets

Thank you very much Anna Morajko Anna.Morajko@uab.es 26 05 2008