Probabilistic Diagnosis of Performance Faults in Large-Scale Parallel Applications

Similar documents
Transactions on Parallel and Distributed Systems

Large Scale Debugging of Parallel Tasks with AutomaDeD!

Probabilistic Diagnosis of Performance Faults in Large-Scale Parallel Applications

Accurate Application Progress Analysis for Large-Scale Parallel Debugging

AutomaDeD: Scalable Root Cause Analysis

Noise Injection Techniques to Expose Subtle and Unintended Message Races

Scalasca performance properties The metrics tour

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Allinea DDT Debugger. Dan Mazur, McGill HPC March 5,

xsim The Extreme-Scale Simulator

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing

MPI Runtime Error Detection with MUST

Introduction to the Message Passing Interface (MPI)

Synchronizing the Record Replay engines for MPI Trace compression. By Apoorva Kulkarni Vivek Thakkar

Intermediate MPI features

High Performance Computing

A Message Passing Standard for MPP and Workstations

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

Typical Bugs in parallel Programs

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

How To Keep Your Head Above Water While Detecting Errors

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

MPI Runtime Error Detection with MUST

Using Performance Tools to Support Experiments in HPC Resilience

MATH 676. Finite element methods in scientific computing

MPIEcho: A Framework for Transparent MPI Task Replication

Distributed Diagnosis of Failures in a Three Tier E-Commerce System. Motivation

Performance properties The metrics tour

Basic Communication Operations (Chapter 4)

Performance properties The metrics tour

Chapter 3. Distributed Memory Programming with MPI

Overcoming Distributed Debugging Challenges in the MPI+OpenMP Programming Model

Parallel Programming

Standard MPI - Message Passing Interface

Non Intrusive Detection & Diagnosis of Failures in High Throughput Distributed Applications. DCSL Research Thrusts

Számítogépes modellezés labor (MSc)

Performance properties The metrics tour

Scalasca performance properties The metrics tour

MPI Performance Analysis and Optimization on Tile64/Maestro

MPI 5. CSCI 4850/5850 High-Performance Computing Spring 2018

Parallel Programming in C with MPI and OpenMP

MPI Application Development with MARMOT

Tolerating Hardware Device Failures in Software. Asim Kadav, Matthew J. Renzelmann, Michael M. Swift University of Wisconsin Madison

MPI Runtime Error Detection with MUST and Marmot For the 8th VI-HPS Tuning Workshop

IBM HPC Development MPI update

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

COMPARING ROBUSTNESS OF AIS-BASED MIDDLEWARE IMPLEMENTATIONS

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Performance analysis basics

CS 6230: High-Performance Computing and Parallelization Introduction to MPI

Parallel programming MPI

Introduction to Parallel Performance Engineering

Quickly Pinpoint and Resolve Problems in Windows /.NET Applications TECHNICAL WHITE PAPER

Motivation. Overview. Scalable Dynamic Analysis for Automated Fault Location and Avoidance. Rajiv Gupta. Program Execution

Slides prepared by : Farzana Rahman 1

A Serializability Violation Detector for Shared-Memory Server Programs

A Message Passing Standard for MPP and Workstations. Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W.

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance

Race Catcher. Automatically Pinpoints Concurrency Defects in Multi-threaded JVM Applications with 0% False Positives.

Tools and Methodology for Ensuring HPC Programs Correctness and Performance. Beau Paisley

HPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh

Scaling Tuple-Space Communication in the Distributive Interoperable Executive Library. Jason Coan, Zaire Ali, David White and Kwai Wong

DEADLOCK DETECTION IN MPI PROGRAMS

Debugging Reinvented: Asking and Answering Why and Why Not Questions about Program Behavior

CS 61C: Great Ideas in Computer Architecture. Lecture 23: Virtual Memory. Bernhard Boser & Randy Katz

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

Distributed Memory Programming with MPI

Using Lamport s Logical Clocks

Automated, scalable debugging of MPI programs with Intel Message Checker

Data parallelism. [ any app performing the *same* operation across a data stream ]

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Getting Insider Information via the New MPI Tools Information Interface

Collective Communication in MPI and Advanced Features

Advanced MPI. Andrew Emerson

High-Performance Computing: MPI (ctd)

Resource allocation and utilization in the Blue Gene/L supercomputer

ECMWF Workshop on High Performance Computing in Meteorology. 3 rd November Dean Stewart

GROMACS Performance Benchmark and Profiling. August 2011

COMP528: Multi-core and Multi-Processor Computing

MPI Correctness Checking with MUST

Locating Faults Through Automated Predicate Switching

Effective Testing for Live Applications. March, 29, 2018 Sveta Smirnova

Using Co-Array Fortran to Enhance the Scalability of the EPIC Code

Challenges in HPC I/O

Document Classification

Parallel I/O using standard data formats in climate and NWP

DDT: A visual, parallel debugger on Ra

Automated Documentation Inference to Explain Failed Tests

Distributed Memory Programming With MPI (4)

EXPLODE: a Lightweight, General System for Finding Serious Storage System Errors. Junfeng Yang, Can Sar, Dawson Engler Stanford University

MATH 676. Finite element methods in scientific computing

Message Passing Interface. most of the slides taken from Hanjun Kim

Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI

SCALASCA parallel performance analyses of SPEC MPI2007 applications

High Performance Data Analytics for Numerical Simulations. Bruno Raffin DataMove

Distributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello

Optimising MPI for Multicore Systems

Transcription:

International Conference on Parallel Architectures and Compilation Techniques (PACT) Minneapolis, MN, Sep 21th, 2012 Probabilistic Diagnosis of Performance Faults in Large-Scale Parallel Applications Ignacio Laguna, Saurabh Bagchi Dong H. Ahn, Bronis R. de Supinski, Todd Gamblin Slide 1/24 Developing Resilient HPC Applications is Challenging MTTF of hours in Future Exascale Supercomputers Faults come from: Hardware Software Network Software bugs from: Application Libraries OS & Runtime system Multiple manifestations: Hangs, crashes Silent data corruption Applications is slower than usual Slide 2/24 1

Some Faults Manifest Only at Large Scale Molecular dynamics simulation code (ddcmd) Fault Characteristics Application hangs with 8,000 MPI tasks Only fails in Blue Gene/L Manifestation was intermittent Large amount of time spent on fixing the problem Out technique isolated the origin of the problem in a few seconds Slide 3/24 Why Do We Need New Debugging Tools? Current Tools Future Tools -Old (breakpoint) technology (>30 years old) - Manual process to find bugs - Poor scalability - Automatic problem determination - Less human intervention in determining failure root cause - Scalable (~millions of processes) Slide 4/24 2

Approach Overview We focus on pinpointing the origin of performance faults: Application hangs Execution is slower than usual Could have multiple causes: Deadlocks, slower code regions, communication problems, etc. Model parallel tasks behavior 1 Find faulty task(s) 2 Find problematic code region 3 Slide 5/24 Summarizing Execution History HPC applications generate a large amount of traces Use a probabilistic model to summarize traces We model control flow behavior of MPI tasks Allow us to find the least progressed task Slide 6/24 3

Each MPI Tasks is Modeled as a Markov Model Sample code foo() { MPI_gather( ) // Computation code for ( ) { // Computation code MPI_Send( ) // Computation code MPI_Recv( ) // Computation code 0.75 Markov Model MPI_Gather Comp. Code 1 MPI_Send MPI calls wrappers: - Gather call stack - Create states in the model Comp. Code 2 Comp. Code 3 0.6 MPI_Recv 0.3 Slide 7/24 Approach Overview Model parallel tasks behavior 1 Find faulty task(s) 2 Find problematic code region 3 Slide 8/24 4

The Progress Dependence Graph task A MPI program wait task B wait task C Code region task A wait task C Facilitates finding the origin of performance faults Allows programmer to focus on the origin of the problem: The least progressed task Slide 9/24 What Tasks are Progress Dependent On Other Tasks? Point-to-Point Operations Task X: // computation code... MPI_Recv(, tasky, ) //... - Task X depends on task Y - Dependency can be obtained from MPI calls parameters and request handlers Collective Operations Task X: // computation code... MPI_Reduce( ) //... - Multiple implementations (e.g., binomial trees) - A task can reach MPI_Reduce and continue - Task X could block waiting for another task (less progressed) Slide 10/24 5

Probabilistic Inference of Progress-Dependence Graph Sample Markov Model Task B 3 Task A 1 0.3 2 0.7 Task C 5 7 0.9 Task E 6 4 0.1 Task D 8 9 10 Progress dependence between tasks B and C? Probability(3 -> 5) = Probability(5 -> 3) = 0 Task C is likely waiting for task B (A task in 3 always reaches 5) C has progressed further than B Slide 11/24 Resolving Conflicting Probability Values Sample Markov Model Task A 1 0.3 2 0.7 Dependence between tasks B and D? Probability(3 -> 9) = 0 Probability(9-> 3) = 0 The dependency is null Task B 3 Task D 4 Task C 5 7 0.9 Task E 6 0.1 8 9 10 Dependence between tasks C and E? Probability(7 -> 5) = Probability(5-> 7) = 0.9 Heuristic: Trust the highest probability C is likely waiting for E Slide 12/24 6

Distributed Algorithm to Infer the Graph Tasks..... 1 2 3 n x y z n..... 1 2 3 n x y... x y... x y... x y y x z y x z y x..... 1 2 3 n z y x z Time All-reduction of current states All tasks know the state of others Build (locally) progressdependence graph 1 y x z Progress dependence graph Reduction of progressdependence graphs Reductions are O(log #tasks) Slide 13/24 Reduction Operations: Graph Dependencies Unions Examples of reduction operations X Y: X is progress dependent on Y Task A Task B Result X Y X Y X Y (Same dependence) X Y Null X Y (First dominates) X Y Y X Undefined (or Null) Null Null Null Slide 14/24 7

Bug Progress Dependence Graph Hang with ~8,000 MPI tasks in BlueGene/L [3136] Least-progressed task [0, 2048,3072] [6840] [1-2047,3073-3135, ] [6841-7995] Our tool finds that task 3136 is the origin of the hang How did it reach its current state? Slide 15/24 Approach Overview Model parallel tasks behavior 1 Find faulty task(s) 2 Find problematic code region 3 Slide 16/24 8

Finding the Faulty Code Region: Program Slicing Progress dependence graph Task 1 Task 2 Task 3 Task 4 Slicing Tool done = 1; for (...) { if (event) { flag = 1; if (flag == 1) { MPI_;...... if (done == 1) { MPI_Barrier(); Task 1 State Task 2 State Slide 17/24 Code to Handle Buffered I/O in DDCMD Writer: Non-Writer Signals Data Signal Data datawritten = 0 for ( ) { Probe(, &flag, ) if (flag == 1) { datawritten = 1 // Write data if (datawritten == 0) { Reduce() Barrier() Check if another writer asks for data Slide 18/24 9

Slice With Origin of the Bug datawritten = 0 for ( ) { Probe(, &flag, ) if (flag == 1) { datawritten = 1 // Write data if (datawritten == 0) { Reduce() Barrier() Leastprogressed task State Dual condition occurs in BlueGene/L A task is a writer and a non-writer MPI_Probe checks for source, tag and comm of a message Another writer intercepted wrong message Programmer used unique MPI tags to isolate different I/O groups Slide 19/24 Fault Injections Methodology Faults injected in Two Sequoia benchmarks: AMG-2006 LAMMPS We injected a hang in random MPI tasks: 20 user function calls, 5 MPI calls Only injected in executed functions Functions are selected randomly Slide 20/24 10

Accurate Detection of Least-Prog. Tasks Least-progressed task detection recall: Cases when LP task is detected correctly Imprecision: % of extra tasks in LP tasks set Example Runs: 64 tasks, fault injected in task 3 [3] Example 1 [3, 5, 4] Example 2 [1,5, ] [2,4, ] LP task detected [0,6-8, ] Imprecision = 0 [1,9, ] [27, ] LP task detected Imprecision = 2/3 Overall results: Average LP task detection recall is 88% 86% of injections has imprecision of zero Slide 21/24 Performance Results: Least-Prog. Task Detection Takes a Fraction Of A Second Slide 22/24 11

Performance Results: Slowdown Is Small For a Variety of Benchmarks Tested slowdown with NAS Parallel and Sequoia benchmarks Maximum slowdown of ~1.67 Slowdown depends on number of MPI calls from different contexts Slide 23/24 Conclusions Our debugging approach diagnose faults in HPC applications Novelties: Compression of historic control-flow behavior Probabilistic inference of the least-progressed tasks Guided application of program slicing Distributed debugging method is scalable Takes fraction of a second with 32,000 BlueGene/P tasks Successful evaluation with hard-to-detect bug and representative benchmarks Slide 24/24 12

Thank you! Slide 25/24 Backup Slides Slide 26/24 13

Dual Role Due to BlueGene/L I/O Structure Node X Task 1 Task 2 Node Y Task 5 Task 6 Group A Group B Dual Role Task 6: Non-writer for its own group (B) Writer for a different group (A) I/O In BlueGene/L, I/O is performed through dedicated nodes Program nominates only one task per I/O node Slide 27/24 14