Large Scale Debugging of Parallel Tasks with AutomaDeD!

Size: px

Start display at page:

Download "Large Scale Debugging of Parallel Tasks with AutomaDeD!"

Marylou Pierce
5 years ago
Views:

International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Seattle, Nov, 0 Large Scale Debugging of Parallel Tasks with AutomaDeD Ignacio Laguna, Saurabh Bagchi

1 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Seattle, Nov, 0 Large Scale Debugging of Parallel Tasks with AutomaDeD Ignacio Laguna, Saurabh Bagchi Todd Gamblin, Bronis R. de Supinski, Greg Bronevetsky, Dong H. Ahn, Martin Schulz, Barry Rountree Slide / Debugging Large-Scale Parallel Applications is Challenging Millions of cores soon in largest systems Increased difficulty in developing correct HPC applications Poor scalability of traditional debuggers Hardware Faults come from various sources Software Physical degradation Soft / hard errors Performance faults Coding bugs Misconfigurations Slide /

Fault Detection/Diagnosis is done Offline Eisting approaches (debuggers & statistical tools) Our approach Time Fault manifestation Stream traces to a central component Time Fault manifestation

2 Fault Detection/Diagnosis is done Offline Eisting approaches (debuggers & statistical tools) Our approach Time Fault manifestation Stream traces to a central component Time Fault manifestation Analyze traces online from all processes Pinpoint where the problem is Analyze traces offline in central component How? - Distributed / scalable methods - Efficient data structures Pinpoint where the problem is Slide / PREVIOUS WORK AutomaDeD Fault detection & diagnosis in MPI applications (DSN 0) MPI Application Task Task Task N Collect traces via MPI wrappers Before and after call MPI Profiler S S S S S S Semi-Markov Models (SMM) Find offline: S S S phase task code region MPI_Send() { tracesbeforecall(); PMPI_Send(); tracesaftercall(); } Collect call-stack info Time measurements Slide /

3 PREVIOUS WORK Modeling Timing and Control-Flow Structure MPI Application Task Task Task N MPI Profiler S S S S S S S S S Semi-Markov Models (SMM) Find offline: phase task code region States: (a) code of MPI call, or (b) code between MPI calls Edges: Transition probability Time distribution Send() After_Send() [0.6], Slide / PREVIOUS WORK Detection of Anomalous Phase / Task / Code-Region MPI Application Task Task Task N MPI Profiler S S S S S S S S S Semi-Markov Models (SMM) Find offline: phase task code region Dissimilarity between models Diss (SMM, SMM ) 0 Cluster the models Find unusual cluster(s) Use known normal clustering setting Master-slave program M M M7 M M 9 M M M 6 M M 8 0 M M Unusual cluster Faulty tasks Slide 6/

4 CURRENT WORK Contributions and Remaining Agenda Online fault detection using AutomaDeD Efficient model comparison Scalable faulty-task detection: CAPEK clustering, NN Accurate faulty-task isolation: model graph compression Evaluation at scale (> K processes) Slide 7/ How to Compare Two Semi-Markov Models? Task Task? Compare graph edge by edge Add up dissimilarities Area of overlap TP = 0. TP = 0.6 TD = N(µ, σ) TD = N(µ, σ) Diss (SMM, SMM ) = Σ Lk-norm (TP) + Σ Lk-norm (TD), TP: transition probability TD: time distribution Lk-norm Slide 8/

5 Efficient Comparison of Models Graphs Lk-norm Task Task? Computationally epensive Have to evaluate integral for each edge Solution: Approimate it using pre-computed lookup table Compleity = constant time Slide 9/ Scalable Faulty-Task Isolation Typical use-case: AutomaDeD isolates abnormal task(s) after failures Input in large-scale applications is often thousands of tasks Comparing each task against each other doesn t scale: Compleity O( #tasks ) Tas ks in E rroneous R un Abnormal task Challenge: Find scalable algorithm for outlier detection Slide 0/

6 Faulty-Task Isolation Using Clustering (CAPEK) Tas ks in E rroneous R un Abnormal task Clustering Approach + + Cluster medoid Abnormal task is far away from its cluster medoid CAPEK, scalable clustering algorithm (ICS 0) Designed for large-scale distributed data sets Based on sampling a constant number of tasks Compleity O(log #tasks) Slide / Algorithm Using Clustering Clustering Approach + + Cluster medoid Outlier task () Perform clustering using CAPEK () Find distances from medoids () Normalize distances () Find top-k outliers sorting tasks based on the largest distances The algorithm is fully distributed Doesn t require a central component to perform the analysis Compleity O(log #tasks) Slide / 6

7 Faulty-Task Isolation Using Nearest-Neighbor (NN) Nearest- Neighbor Approach Outlier task () Sample constant number of tasks () Broadcast samples to all tasks () Find NN distance () Sort tasks based on distances and select top-k ones Sample point Assumption is that faulty task will be far from its NN Faster than clustering Works well only when we have one (or a few) faulty task(s) Compleity O(log #tasks) Slide / Too Many Graph Edges The Curse of Dimensionality Sample SMM graph > 00 edges Too many edges = Too many dimensions Poor performance of Clustering & Nearest-Neighbor Slide / 7

Graph Compression to Reduce Dimensionality Sequences of edges can be merged Typically at the beginning / end of program main loop Original.0 N(0., 0.).0 N(0., 0.).0 N(0., 0.) Compressed.

0) Add up parameters of Normal distribution: mean = Σ mean STD = Σ std Slide / Eample of Graph Compression Sample code MPI_Init() // comp code MPI_Gather() // comp code

8 Graph Compression to Reduce Dimensionality Sequences of edges can be merged Typically at the beginning / end of program main loop Original.0 N(0., 0.).0 N(0., 0.).0 N(0., 0.) Compressed.0 N(., 0.) Same transition probability (.0) Add up parameters of Normal distribution: mean = Σ mean STD = Σ std Slide / Eample of Graph Compression Sample code MPI_Init() // comp code MPI_Gather() // comp code for (...) { // comp code MPI_Send() // comp code MPI_Recv() // comp code } // comp code 6 MPI_Bcast() // comp code 7 MPI_Finalize() = Semi-Markov Model Init Comp Gather Comp, Send Comp Recv Comp,6, Bcast Comp 7 Finalize Compressed Semi-Markov Model Init Send Comp,6, Finalize Slide 6/ 8

9 Fault Injection Types We inject faults into the NAS Parallel Benchmarks: BT, SP, CG, FT, LU Injections occur at random {MPI call, task} Linu Sierra cluster at LLNL (si-core nodes,.8 GHz, GB RAM) Total of 960 eperiments Type CPU_INTENSIVE MEM_INTENSIVE HANG Description CPU-intensive code region triply nested loop Memory-intensive code filling GB buffer at random locations Local deadlock process suspend eecution indefinitely TRANS_STALL Transient stall process suspend eecution for seconds Slide 7/ Evaluation Metric Task Isolation Recall: Fraction of runs in which the faulty task (where fault is injected) is in the top- abnormal processes Eample: Run Run Run Fault injected in task Top- abnormal tasks Task-Isolation Recall = / = 0.67 Slide 8/ 9

10 Graph Compression Results NN Task- Isolation Recall BT BT Clustering Task- Isolation Recall BT BT SP SP SP SP CG CG CG CG FT FT FT FT LU LU LU LU MG MG MG MG Baseline (without compression) Compression Baseline (without compression) Compression CPU_INTENSIVE MEM_INTENSIVE HANG TRANS_ STALL Compression improves task-isolation recall Dimensionality reduction effectively helps clustering and NN For eample, for BT recall of 8% improves to 00% with NN NN seems better but injections occur only in one task Slide 9/ Performance Results: Overhead reduction using pre-computed Lk-norm Comparison Time of Two SMMs Log time (sec) Baseline Pre- computed Lk- norm SMM size (# edges) Using pre-computed table to estimate Lk-norm reduces overhead of comparing SMM graphs Pre-computed method is 6 faster than baseline Slide 0/ 0

11 Performance Eperiments at Scale Use Algebraic Multigrid Benchmark (AMG 006) Scalable multigrid solver Demonstrated up to,000 tasks in BlueGene/L Ran with over,000 tasks in LLNL Sierra Linu cluster Measure time of edge / task isolation and compression Slide / Performance Results: Time to perform analysis at large scale Baseline (without compression) Compression NN Method Clustering Method NN Method Clustering Method Time (sec) Number of Tasks Time (sec) Number of Tasks Compression doesn t incur in much overhead Most of its computation is performed locally Entire analysis is performed in less than seconds (K tasks) Analysis time scales logarithmically w.r.t. number of tasks Time (sec) 0 Task Isolation 6 Graph Compression Edge Isolation Number of Tasks Time (sec) Number of Tasks Slide /

12 Concluding Remarks Contributions: Scalable technique to detect faults in MPI applications Implementation scales easily to thousands of tasks Compressing task graph improves anomaly detection accuracy Future work: Etend compression technique to allows finer grained instrumentation of function calls Capture more information to detect a wider range of faults Slide / Bring us your fault / bug at large scale - performance anomaly - coding bug we ll be happy to try AutomaDeD on it Ignacio Laguna <ilaguna@purdue.edu> Slide /

13 Backup Slides Slide / Edge Isolation Results NN E dge- Isolation Recall Clustering E dge- Isolation Recall BT SP CG FT LU MG BT SP CG FT LU MG CPU_INTENSIVE MEM_INTENSIVE HANG TRANS_ STALL Edge isolation recall is high for both clustering and NN Assumes that faulty task has been correctly identified Slide 6/

14 Trend Lines for Analysis Time 6 f() = 0.79 ln() -.8 Time (sec) f() = 0.8 ln() -.07 Clustering NN Number of Tasks Logarithmic curves model the data well 8.67 seconds for 0K tasks,.9 seconds for 00K tasks Slide 7/

Probabilistic Diagnosis of Performance Faults in Large-Scale Parallel Applications

International Conference on Parallel Architectures and Compilation Techniques (PACT) Minneapolis, MN, Sep 21th, 2012 Probabilistic Diagnosis of Performance Faults in Large-Scale Parallel Applications Ignacio