Large Scale Debugging of Parallel Tasks with AutomaDeD!
|
|
- Marylou Pierce
- 5 years ago
- Views:
Transcription
1 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Seattle, Nov, 0 Large Scale Debugging of Parallel Tasks with AutomaDeD Ignacio Laguna, Saurabh Bagchi Todd Gamblin, Bronis R. de Supinski, Greg Bronevetsky, Dong H. Ahn, Martin Schulz, Barry Rountree Slide / Debugging Large-Scale Parallel Applications is Challenging Millions of cores soon in largest systems Increased difficulty in developing correct HPC applications Poor scalability of traditional debuggers Hardware Faults come from various sources Software Physical degradation Soft / hard errors Performance faults Coding bugs Misconfigurations Slide /
2 Fault Detection/Diagnosis is done Offline Eisting approaches (debuggers & statistical tools) Our approach Time Fault manifestation Stream traces to a central component Time Fault manifestation Analyze traces online from all processes Pinpoint where the problem is Analyze traces offline in central component How? - Distributed / scalable methods - Efficient data structures Pinpoint where the problem is Slide / PREVIOUS WORK AutomaDeD Fault detection & diagnosis in MPI applications (DSN 0) MPI Application Task Task Task N Collect traces via MPI wrappers Before and after call MPI Profiler S S S S S S Semi-Markov Models (SMM) Find offline: S S S phase task code region MPI_Send() { tracesbeforecall(); PMPI_Send(); tracesaftercall(); } Collect call-stack info Time measurements Slide /
3 PREVIOUS WORK Modeling Timing and Control-Flow Structure MPI Application Task Task Task N MPI Profiler S S S S S S S S S Semi-Markov Models (SMM) Find offline: phase task code region States: (a) code of MPI call, or (b) code between MPI calls Edges: Transition probability Time distribution Send() After_Send() [0.6], Slide / PREVIOUS WORK Detection of Anomalous Phase / Task / Code-Region MPI Application Task Task Task N MPI Profiler S S S S S S S S S Semi-Markov Models (SMM) Find offline: phase task code region Dissimilarity between models Diss (SMM, SMM ) 0 Cluster the models Find unusual cluster(s) Use known normal clustering setting Master-slave program M M M7 M M 9 M M M 6 M M 8 0 M M Unusual cluster Faulty tasks Slide 6/
4 CURRENT WORK Contributions and Remaining Agenda Online fault detection using AutomaDeD Efficient model comparison Scalable faulty-task detection: CAPEK clustering, NN Accurate faulty-task isolation: model graph compression Evaluation at scale (> K processes) Slide 7/ How to Compare Two Semi-Markov Models? Task Task? Compare graph edge by edge Add up dissimilarities Area of overlap TP = 0. TP = 0.6 TD = N(µ, σ) TD = N(µ, σ) Diss (SMM, SMM ) = Σ Lk-norm (TP) + Σ Lk-norm (TD), TP: transition probability TD: time distribution Lk-norm Slide 8/
5 Efficient Comparison of Models Graphs Lk-norm Task Task? Computationally epensive Have to evaluate integral for each edge Solution: Approimate it using pre-computed lookup table Compleity = constant time Slide 9/ Scalable Faulty-Task Isolation Typical use-case: AutomaDeD isolates abnormal task(s) after failures Input in large-scale applications is often thousands of tasks Comparing each task against each other doesn t scale: Compleity O( #tasks ) Tas ks in E rroneous R un Abnormal task Challenge: Find scalable algorithm for outlier detection Slide 0/
6 Faulty-Task Isolation Using Clustering (CAPEK) Tas ks in E rroneous R un Abnormal task Clustering Approach + + Cluster medoid Abnormal task is far away from its cluster medoid CAPEK, scalable clustering algorithm (ICS 0) Designed for large-scale distributed data sets Based on sampling a constant number of tasks Compleity O(log #tasks) Slide / Algorithm Using Clustering Clustering Approach + + Cluster medoid Outlier task () Perform clustering using CAPEK () Find distances from medoids () Normalize distances () Find top-k outliers sorting tasks based on the largest distances The algorithm is fully distributed Doesn t require a central component to perform the analysis Compleity O(log #tasks) Slide / 6
7 Faulty-Task Isolation Using Nearest-Neighbor (NN) Nearest- Neighbor Approach Outlier task () Sample constant number of tasks () Broadcast samples to all tasks () Find NN distance () Sort tasks based on distances and select top-k ones Sample point Assumption is that faulty task will be far from its NN Faster than clustering Works well only when we have one (or a few) faulty task(s) Compleity O(log #tasks) Slide / Too Many Graph Edges The Curse of Dimensionality Sample SMM graph > 00 edges Too many edges = Too many dimensions Poor performance of Clustering & Nearest-Neighbor Slide / 7
8 Graph Compression to Reduce Dimensionality Sequences of edges can be merged Typically at the beginning / end of program main loop Original.0 N(0., 0.).0 N(0., 0.).0 N(0., 0.) Compressed.0 N(., 0.) Same transition probability (.0) Add up parameters of Normal distribution: mean = Σ mean STD = Σ std Slide / Eample of Graph Compression Sample code MPI_Init() // comp code MPI_Gather() // comp code for (...) { // comp code MPI_Send() // comp code MPI_Recv() // comp code } // comp code 6 MPI_Bcast() // comp code 7 MPI_Finalize() = Semi-Markov Model Init Comp Gather Comp, Send Comp Recv Comp,6, Bcast Comp 7 Finalize Compressed Semi-Markov Model Init Send Comp,6, Finalize Slide 6/ 8
9 Fault Injection Types We inject faults into the NAS Parallel Benchmarks: BT, SP, CG, FT, LU Injections occur at random {MPI call, task} Linu Sierra cluster at LLNL (si-core nodes,.8 GHz, GB RAM) Total of 960 eperiments Type CPU_INTENSIVE MEM_INTENSIVE HANG Description CPU-intensive code region triply nested loop Memory-intensive code filling GB buffer at random locations Local deadlock process suspend eecution indefinitely TRANS_STALL Transient stall process suspend eecution for seconds Slide 7/ Evaluation Metric Task Isolation Recall: Fraction of runs in which the faulty task (where fault is injected) is in the top- abnormal processes Eample: Run Run Run Fault injected in task Top- abnormal tasks Task-Isolation Recall = / = 0.67 Slide 8/ 9
10 Graph Compression Results NN Task- Isolation Recall BT BT Clustering Task- Isolation Recall BT BT SP SP SP SP CG CG CG CG FT FT FT FT LU LU LU LU MG MG MG MG Baseline (without compression) Compression Baseline (without compression) Compression CPU_INTENSIVE MEM_INTENSIVE HANG TRANS_ STALL Compression improves task-isolation recall Dimensionality reduction effectively helps clustering and NN For eample, for BT recall of 8% improves to 00% with NN NN seems better but injections occur only in one task Slide 9/ Performance Results: Overhead reduction using pre-computed Lk-norm Comparison Time of Two SMMs Log time (sec) Baseline Pre- computed Lk- norm SMM size (# edges) Using pre-computed table to estimate Lk-norm reduces overhead of comparing SMM graphs Pre-computed method is 6 faster than baseline Slide 0/ 0
11 Performance Eperiments at Scale Use Algebraic Multigrid Benchmark (AMG 006) Scalable multigrid solver Demonstrated up to,000 tasks in BlueGene/L Ran with over,000 tasks in LLNL Sierra Linu cluster Measure time of edge / task isolation and compression Slide / Performance Results: Time to perform analysis at large scale Baseline (without compression) Compression NN Method Clustering Method NN Method Clustering Method Time (sec) Number of Tasks Time (sec) Number of Tasks Compression doesn t incur in much overhead Most of its computation is performed locally Entire analysis is performed in less than seconds (K tasks) Analysis time scales logarithmically w.r.t. number of tasks Time (sec) 0 Task Isolation 6 Graph Compression Edge Isolation Number of Tasks Time (sec) Number of Tasks Slide /
12 Concluding Remarks Contributions: Scalable technique to detect faults in MPI applications Implementation scales easily to thousands of tasks Compressing task graph improves anomaly detection accuracy Future work: Etend compression technique to allows finer grained instrumentation of function calls Capture more information to detect a wider range of faults Slide / Bring us your fault / bug at large scale - performance anomaly - coding bug we ll be happy to try AutomaDeD on it Ignacio Laguna <ilaguna@purdue.edu> Slide /
13 Backup Slides Slide / Edge Isolation Results NN E dge- Isolation Recall Clustering E dge- Isolation Recall BT SP CG FT LU MG BT SP CG FT LU MG CPU_INTENSIVE MEM_INTENSIVE HANG TRANS_ STALL Edge isolation recall is high for both clustering and NN Assumes that faulty task has been correctly identified Slide 6/
14 Trend Lines for Analysis Time 6 f() = 0.79 ln() -.8 Time (sec) f() = 0.8 ln() -.07 Clustering NN Number of Tasks Logarithmic curves model the data well 8.67 seconds for 0K tasks,.9 seconds for 00K tasks Slide 7/
Probabilistic Diagnosis of Performance Faults in Large-Scale Parallel Applications
International Conference on Parallel Architectures and Compilation Techniques (PACT) Minneapolis, MN, Sep 21th, 2012 Probabilistic Diagnosis of Performance Faults in Large-Scale Parallel Applications Ignacio
More informationAutomaDeD: Scalable Root Cause Analysis
AutomaDeD: Scalable Root Cause Analysis aradyn/dyninst Week March 26, 2012 This work was performed under the auspices of the U.S. Department of Energy by under contract DE-AC52-07NA27344. Lawrence Livermore
More informationTransactions on Parallel and Distributed Systems
Transactions on Parallel and Distributed Systems Diagnosis of Performance Faults in Large Scale Parallel Applications via Probabilistic Progress-Dependence Inference Journal: Transactions on Parallel and
More informationHow To Keep Your Head Above Water While Detecting Errors
USENIX / ACM / IFIP 10th International Middleware Conference How To Keep Your Head Above Water While Detecting Errors Ignacio Laguna, Fahad A. Arshad, David M. Grothe, Saurabh Bagchi Dependable Computing
More informationNoise Injection Techniques to Expose Subtle and Unintended Message Races
Noise Injection Techniques to Expose Subtle and Unintended Message Races PPoPP2017 February 6th, 2017 Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz and Christopher M. Chambreau
More informationExploring Hardware Overprovisioning in Power-Constrained, High Performance Computing
Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing Tapasya Patki 1 David Lowenthal 1 Barry Rountree 2 Martin Schulz 2 Bronis de Supinski 2 1 The University of Arizona
More informationAccurate Application Progress Analysis for Large-Scale Parallel Debugging
Accurate Application Progress Analysis for Large-Scale Parallel Debugging Subrata Mitra, Ignacio Laguna, Dong H. Ahn, Saurabh Bagchi, Martin Schulz, Todd Gamblin Purdue University Lawrence Livermore National
More informationMPIEcho: A Framework for Transparent MPI Task Replication
MPIEcho: A Framework for Transparent MPI Task Replication Guy Cobb [1], Barry Rountree [2], Henry Tufo [1], Martin Schulz [2], Todd Gamblin [2], and Bronis de Supinski [2] University of Colorado at Boulder
More informationProbabilistic Diagnosis of Performance Faults in Large-Scale Parallel Applications
Probabilistic Diagnosis of Performance Faults in Large-Scale Parallel Applications ABSTRACT Debugging large-scale parallel applications is challenging. Most existing techniques provide mechanisms for process
More informationStateful Detection in High Throughput Distributed Systems
Stateful Detection in High Throughput Distributed Systems Gunjan Khanna, Ignacio Laguna, Fahad A. Arshad, Saurabh Bagchi Dependable Computing Systems Lab School of Electrical and Computer Engineering Purdue
More informationPower Bounds and Large Scale Computing
1 Power Bounds and Large Scale Computing Friday, March 1, 2013 Bronis R. de Supinski 1 Tapasya Patki 2, David K. Lowenthal 2, Barry L. Rountree 1 and Martin Schulz 1 2 University of Arizona This work has
More informationSystem Software Solutions for Exploiting Power Limited HPC Systems
http://scalability.llnl.gov/ System Software Solutions for Exploiting Power Limited HPC Systems 45th Martin Schulz, LLNL/CASC SPEEDUP Workshop on High-Performance Computing September 2016, Basel, Switzerland
More informationUsing Performance Tools to Support Experiments in HPC Resilience
Using Performance Tools to Support Experiments in HPC Resilience Thomas Naughton 1,2, Swen Böhm 1, Christian Engelmann 1, and Geoffroy Vallée 1 1 Computer Science and Mathematics Division Oak Ridge National
More informationA Case for High Performance Computing with Virtual Machines
A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation
More informationEReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications
EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications Sourav Chakraborty 1, Ignacio Laguna 2, Murali Emani 2, Kathryn Mohror 2, Dhabaleswar K (DK) Panda 1, Martin Schulz
More informationxsim The Extreme-Scale Simulator
www.bsc.es xsim The Extreme-Scale Simulator Janko Strassburg Severo Ochoa Seminar @ BSC, 28 Feb 2014 Motivation Future exascale systems are predicted to have hundreds of thousands of nodes, thousands of
More informationEfficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI
Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from
More informationJob Startup at Exascale: Challenges and Solutions
Job Startup at Exascale: Challenges and Solutions Sourav Chakraborty Advisor: Dhabaleswar K (DK) Panda The Ohio State University http://nowlab.cse.ohio-state.edu/ Current Trends in HPC Tremendous increase
More informationk-means Gaussian mixture model Maximize the likelihood exp(
k-means Gaussian miture model Maimize the likelihood Centers : c P( {, i c j,...,, c n },...c k, ) ep( i c j ) k-means P( i c j, ) ep( c i j ) Minimize i c j Sum of squared errors (SSE) criterion (k clusters
More informationParallel Debugging. ª Objective. ª Contents. ª Learn the basics of debugging parallel programs
ª Objective ª Learn the basics of debugging parallel programs ª Contents ª Launching a debug session ª The Parallel Debug Perspective ª Controlling sets of processes ª Controlling individual processes
More informationSynchronizing the Record Replay engines for MPI Trace compression. By Apoorva Kulkarni Vivek Thakkar
Synchronizing the Record Replay engines for MPI Trace compression By Apoorva Kulkarni Vivek Thakkar PROBLEM DESCRIPTION An analysis tool has been developed which characterizes the communication behavior
More informationMPI Runtime Error Detection with MUST and Marmot For the 8th VI-HPS Tuning Workshop
MPI Runtime Error Detection with MUST and Marmot For the 8th VI-HPS Tuning Workshop Tobias Hilbrich and Joachim Protze ZIH, Technische Universität Dresden September 2011 Content MPI Usage Errors Error
More informationChapter 3. Distributed Memory Programming with MPI
An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap n Writing your first MPI program. n Using the common MPI functions. n The Trapezoidal Rule
More informationReducing Network Contention with Mixed Workloads on Modern Multicore Clusters
Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational
More informationDeveloping, Debugging, and Optimizing GPU Codes for High Performance Computing with Allinea Forge
Developing, Debugging, and Optimizing GPU Codes for High Performance Computing with Allinea Forge Ryan Hulguin Applications Engineer ryan.hulguin@arm.com Agenda Introduction Overview of Allinea Products
More informationGetting Insider Information via the New MPI Tools Information Interface
Getting Insider Information via the New MPI Tools Information Interface EuroMPI 2016 September 26, 2016 Kathryn Mohror This work was performed under the auspices of the U.S. Department of Energy by Lawrence
More informationOvercoming Distributed Debugging Challenges in the MPI+OpenMP Programming Model
Overcoming Distributed Debugging Challenges in the MPI+OpenMP Programming Model Lai Wei, Ignacio Laguna, Dong H. Ahn Matthew P. LeGendre, Gregory L. Lee This work was performed under the auspices of the
More informationDistributed Diagnosis of Failures in a Three Tier E-Commerce System. Motivation
Distributed Diagnosis of Failures in a Three Tier E-ommerce System Gunjan Khanna, Ignacio Laguna, Fahad A. Arshad, and Saurabh agchi Dependable omputing Systems Lab (DSL) School of Electrical and omputer
More informationUsing Lamport s Logical Clocks
Fast Classification of MPI Applications Using Lamport s Logical Clocks Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan Florida State University Los Alamos National Laboratory 1 Motivation Conventional trace-based
More informationPerformance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology
Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology Shinobu Miwa Sho Aita Hiroshi Nakamura The University of Tokyo {miwa, aita, nakamura}@hal.ipc.i.u-tokyo.ac.jp
More informationL17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011!
L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011! Administrative Class cancelled, Tuesday, November 15 Guest Lecture, Thursday, November 17, Ganesh Gopalakrishnan CUDA Project
More informationDynamic Flow Regulation for IP Integration on Network-on-Chip
Dynamic Flow Regulation for IP Integration on Network-on-Chip Zhonghai Lu and Yi Wang Dept. of Electronic Systems KTH Royal Institute of Technology Stockholm, Sweden Agenda The IP integration problem Why
More informationOrion+: Automated Problem Diagnosis in Computing Systems by Mining Metric Data
Orion+: Automated Problem Diagnosis in Computing Systems by Mining Metric Data Charitha Saumya Nomchin Banga Shreya Inamdar Purdue University School of Electrical and Computer Engineering 1 Introduction
More informationDISP: Optimizations Towards Scalable MPI Startup
DISP: Optimizations Towards Scalable MPI Startup Huansong Fu, Swaroop Pophale*, Manjunath Gorentla Venkata*, Weikuan Yu Florida State University *Oak Ridge National Laboratory Outline Background and motivation
More informationImplementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand
Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand Jiuxing Liu and Dhabaleswar K. Panda Computer Science and Engineering The Ohio State University Presentation Outline Introduction
More informationScalasca performance properties The metrics tour
Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware
More informationPeter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved
An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap Writing your first MPI program. Using the common MPI functions. The Trapezoidal Rule in
More informationDistributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved
An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap Writing your first MPI program. Using the common MPI functions. The Trapezoidal Rule in
More informationCSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)
Parallel Algorithms on a cluster of PCs Ian Bush Daresbury Laboratory I.J.Bush@dl.ac.uk (With thanks to Lorna Smith and Mark Bull at EPCC) Overview This lecture will cover General Message passing concepts
More informationEvaluating the Error Resilience of Parallel Programs
Evaluating the Error Resilience of Parallel Programs Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia Sudhanva Gurumurthi AMD Research 1 Reliability trends The soft error
More informationData Reduction and Partitioning in an Extreme Scale GPU-Based Clustering Algorithm
Data Reduction and Partitioning in an Extreme Scale GPU-Based Clustering Algorithm Benjamin Welton and Barton Miller Paradyn Project University of Wisconsin - Madison DRBSD-2 Workshop November 17 th 2017
More informationProblem Diagnosis in Large-Scale Computing Environments
Problem Diagnosis in Large-Scale Computing Environments Alexander V. Mirgorodskiy VMware, Inc. Naoya Maruyama Department of Mathematical and Computing Sciences Tokyo Institute of Technology Barton P. Miller
More informationSingle-Points of Performance
Single-Points of Performance Mellanox Technologies Inc. 29 Stender Way, Santa Clara, CA 9554 Tel: 48-97-34 Fax: 48-97-343 http://www.mellanox.com High-performance computations are rapidly becoming a critical
More informationThe Use of the MPI Communication Library in the NAS Parallel Benchmarks
The Use of the MPI Communication Library in the NAS Parallel Benchmarks Theodore B. Tabe, Member, IEEE Computer Society, and Quentin F. Stout, Senior Member, IEEE Computer Society 1 Abstract The statistical
More informationCommunication Characteristics in the NAS Parallel Benchmarks
Communication Characteristics in the NAS Parallel Benchmarks Ahmad Faraj Xin Yuan Department of Computer Science, Florida State University, Tallahassee, FL 32306 {faraj, xyuan}@cs.fsu.edu Abstract In this
More informationBasic Communication Operations (Chapter 4)
Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:
More information3D WiNoC Architectures
Interconnect Enhances Architecture: Evolution of Wireless NoC from Planar to 3D 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan Sep 18th, 2014 Hiroki Matsutani, "3D WiNoC Architectures",
More informationEFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI
EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,
More informationESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report
ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new
More informationDMTracker: Finding Bugs in Large-Scale Parallel Programs by Detecting Anomaly in Data Movements
DMTracker: Finding Bugs in Large-Scale Parallel Programs by Detecting Anomaly in Data Movements Qi Gao Feng Qin Dhabaleswar K. Panda Department of Computer Science and Engineering The Ohio State University,
More informationIntroduction to the Message Passing Interface (MPI)
Introduction to the Message Passing Interface (MPI) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction to the Message Passing Interface (MPI) Spring 2018
More informationNon Intrusive Detection & Diagnosis of Failures in High Throughput Distributed Applications. DCSL Research Thrusts
Non Intrusive Detection & Diagnosis of Failures in High Throughput Distributed Applications Saurabh Bagchi Dependable Computing Systems Lab (DCSL) & The Center for Education and Research in Information
More informationLOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig
LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2016 Hermann Härtig LECTURE OBJECTIVES starting points independent Unix processes and block synchronous execution which component (point in
More informationCommunication Patterns
Communication Patterns Rolf Riesen Sandia National Laboratories P.O. Box 5 Albuquerque, NM 715-111 rolf@cs.sandia.gov Abstract Parallel applications have message-passing patterns that are important to
More informationPerformance properties The metrics tour
Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de August 2012 Scalasca analysis result Online description Analysis report explorer
More informationScore-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1
Score-P SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1 Score-P Functionality Score-P is a joint instrumentation and measurement system for a number of PA tools. Provide
More informationPhase-Based Application-Driven Power Management on the Single-chip Cloud Computer
Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer Nikolas Ioannou, Michael Kauschke, Matthias Gries, and Marcelo Cintra University of Edinburgh Intel Labs Braunschweig Introduction
More informationTools and Methodology for Ensuring HPC Programs Correctness and Performance. Beau Paisley
Tools and Methodology for Ensuring HPC Programs Correctness and Performance Beau Paisley bpaisley@allinea.com About Allinea Over 15 years of business focused on parallel programming development tools Strong
More informationJob Startup at Exascale:
Job Startup at Exascale: Challenges and Solutions Hari Subramoni The Ohio State University http://nowlab.cse.ohio-state.edu/ Current Trends in HPC Supercomputing systems scaling rapidly Multi-/Many-core
More informationA Message Passing Standard for MPP and Workstations. Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W.
1 A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker 2 Message Passing Interface (MPI) Message passing library Can
More informationLLNL Tool Components: LaunchMON, P N MPI, GraphLib
LLNL-PRES-405584 Lawrence Livermore National Laboratory LLNL Tool Components: LaunchMON, P N MPI, GraphLib CScADS Workshop, July 2008 Martin Schulz Larger Team: Bronis de Supinski, Dong Ahn, Greg Lee Lawrence
More informationRuntime Support for Scalable Task-parallel Programs
Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May 2018 http://hpc.pnl.gov/people/sriram/ Single Program Multiple Data int main () {... } 2 Task Parallelism
More informationABHRANTA: Locating Bugs that Manifest at Large System Scales
ABHRANTA: Locating Bugs that Manifest at Large System Scales Bowen Zhou, Milind Kukarni, Saurabh Bagchi Purdue University Abstract A key challenge in developing large scale applications (both in system
More informationRDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits
RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits Sayantan Sur Hyun-Wook Jin Lei Chai D. K. Panda Network Based Computing Lab, The Ohio State University Presentation
More informationAdvanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED
Advanced Software for the Supercomputer PRIMEHPC FX10 System Configuration of PRIMEHPC FX10 nodes Login Compilation Job submission 6D mesh/torus Interconnect Local file system (Temporary area occupied
More informationCollective Communication in MPI and Advanced Features
Collective Communication in MPI and Advanced Features Pacheco s book. Chapter 3 T. Yang, CS240A. Part of slides from the text book, CS267 K. Yelick from UC Berkeley and B. Gropp, ANL Outline Collective
More informationFujitsu s Approach to Application Centric Petascale Computing
Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview
More informationParallel Processing. Majid AlMeshari John W. Conklin. Science Advisory Committee Meeting September 3, 2010 Stanford University
Parallel Processing Majid AlMeshari John W. Conklin 1 Outline Challenge Requirements Resources Approach Status Tools for Processing 2 Challenge A computationally intensive algorithm is applied on a huge
More informationCS 6230: High-Performance Computing and Parallelization Introduction to MPI
CS 6230: High-Performance Computing and Parallelization Introduction to MPI Dr. Mike Kirby School of Computing and Scientific Computing and Imaging Institute University of Utah Salt Lake City, UT, USA
More informationSelf Checking Network Protocols: A Monitor Based Approach
Self Checking Network Protocols: A Monitor Based Approach Gunjan Khanna, Padma Varadharajan, Saurabh Bagchi Dependable Computing Systems Lab School of Electrical and Computer Engineering Purdue University
More informationMPI 5. CSCI 4850/5850 High-Performance Computing Spring 2018
MPI 5 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives
More informationScalable Compression and Replay of Communication Traces in Massively Parallel Environments
Scalable Compression and Replay of Communication Traces in Massively Parallel Environments Michael Noeth 1, Frank Mueller 1, Martin Schulz 2, Bronis R. de Supinski 2 1 North Carolina State University 2
More informationStrato and Strato OS. Justin Zhang Senior Applications Engineering Manager. Your new weapon for verification challenge. Nov 2017
Strato and Strato OS Your new weapon for verification challenge Justin Zhang Senior Applications Engineering Manager Nov 2017 Emulation Market Evolution Emulation moved to Virtualization with Veloce2 Data
More informationScalasca performance properties The metrics tour
Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware
More informationPerformance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and
Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet Swamy N. Kandadai and Xinghong He swamy@us.ibm.com and xinghong@us.ibm.com ABSTRACT: We compare the performance of several applications
More informationcustinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader
custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader What we will see today The first dynamic graph data structure for the GPU. Scalable in size Supports the same functionality
More informationPerformance properties The metrics tour
Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de January 2012 Scalasca analysis result Confused? Generic metrics Generic metrics Time
More informationPERFORMANCE anomalies, which may be caused by various
1902 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 7, JULY 2016 A Scalable, Non-Parametric Method for Detecting Performance Anomaly in Large Scale Computing Li Yu, Student Member,
More informationParallel Computing and the MPI environment
Parallel Computing and the MPI environment Claudio Chiaruttini Dipartimento di Matematica e Informatica Centro Interdipartimentale per le Scienze Computazionali (CISC) Università di Trieste http://www.dmi.units.it/~chiarutt/didattica/parallela
More informationDistributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello
Distributed recovery for senddeterministic HPC applications Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello 1 Fault-tolerance in HPC applications Number of cores on one CPU and
More informationSMB 3.0 Performance Dan Lovinger Principal Architect Microsoft
SMB 3.0 Performance Dan Lovinger Principal Architect Microsoft Overview Stats & Methods Scenario: OLTP Database Scenario: Cluster Motion SMB 3.0 Multi Channel Agenda: challenges during the development
More informationBenchmarking CPU Performance. Benchmarking CPU Performance
Cluster Computing Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance,
More informationCombining Static and Dynamic Analysis for Top-down Communication Trace Compression
6th Annua MVAPICH User Group (MUG) Meeting Combining Static and Dynamic Anaysis for Top-down Communication Trace Compression Jidong Zhai Tsinghua University Communication Traces Communication Traces highy
More informationA Global Operating System for HPC Clusters
A Global Operating System Emiliano Betti 1 Marco Cesati 1 Roberto Gioiosa 2 Francesco Piermaria 1 1 System Programming Research Group, University of Rome Tor Vergata 2 BlueGene Software Division, IBM TJ
More informationPerformance Modeling for Systematic Performance Tuning
Performance Modeling for Systematic Performance Tuning Torsten Hoefler with inputs from William Gropp, Marc Snir, Bill Kramer Invited Talk RWTH Aachen University March 30 th, Aachen, Germany All used images
More informationThe Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing
The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Parallelism Decompose the execution into several tasks according to the work to be done: Function/Task
More informationAdvanced MPI. Andrew Emerson
Advanced MPI Andrew Emerson (a.emerson@cineca.it) Agenda 1. One sided Communications (MPI-2) 2. Dynamic processes (MPI-2) 3. Profiling MPI and tracing 4. MPI-I/O 5. MPI-3 11/12/2015 Advanced MPI 2 One
More informationExercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing
Exercises: April 11 1 PARTITIONING IN MPI COMMUNICATION AND NOISE AS HPC BOTTLENECK LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2017 Hermann Härtig THIS LECTURE Partitioning: bulk synchronous
More informationPerformance properties The metrics tour
Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de Scalasca analysis result Online description Analysis report explorer GUI provides
More informationHigh-Performance Broadcast for Streaming and Deep Learning
High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth - SC17 2 Outline Introduction
More informationOn the scalability of tracing mechanisms 1
On the scalability of tracing mechanisms 1 Felix Freitag, Jordi Caubet, Jesus Labarta Departament d Arquitectura de Computadors (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat Politècnica
More informationMPI Application Development with MARMOT
MPI Application Development with MARMOT Bettina Krammer University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Matthias Müller University of Dresden Centre for Information
More informationEfficient Data Race Detection for Unified Parallel C
P A R A L L E L C O M P U T I N G L A B O R A T O R Y Efficient Data Race Detection for Unified Parallel C ParLab Winter Retreat 1/14/2011" Costin Iancu, LBL" Nick Jalbert, UC Berkeley" Chang-Seo Park,
More informationOptimization of MPI Applications Rolf Rabenseifner
Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization
More informationToward Runtime Power Management of Exascale Networks by On/Off Control of Links
Toward Runtime Power Management of Exascale Networks by On/Off Control of Links, Nikhil Jain, Laxmikant Kale University of Illinois at Urbana-Champaign HPPAC May 20, 2013 1 Power challenge Power is a major
More informationImplementation and Evaluation of a Scalable Application-level Checkpoint-Recovery Scheme for MPI Programs
Implementation and Evaluation of a Scalable Application-level Checkpoint-Recovery Scheme for MPI Programs Martin Schulz, Greg Bronevetsky, Rohit Fernandes, Daniel Marques, Keshav Pingali, Paul Stodghill
More informationPREDICTING COMMUNICATION PERFORMANCE
PREDICTING COMMUNICATION PERFORMANCE Nikhil Jain CASC Seminar, LLNL This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract
More informationMulti-Application Online Profiling Tool
Multi-Application Online Profiling Tool Vi-HPS Julien ADAM, Antoine CAPRA 1 About MALP MALP is a tool originally developed in CEA and in the University of Versailles (UVSQ) It generates rich HTML views
More informationAutomated, scalable debugging of MPI programs with Intel Message Checker
Automated, scalable debugging of MPI programs with Intel Message Checker Jayant DeSouza Intel Corporation 1906 Fox Drive Champaign, IL +1 217 403 4242 jayant.desouza@intel.com Bob Kuhn Intel Corporation
More informationEconomic Viability of Hardware Overprovisioning in Power- Constrained High Performance Compu>ng
Economic Viability of Hardware Overprovisioning in Power- Constrained High Performance Compu>ng Energy Efficient Supercompu1ng, SC 16 November 14, 2016 This work was performed under the auspices of the U.S.
More information