Improving the Scalability of Comparative Debugging with MRNet

Similar documents
It s not my fault! Finding errors in parallel codes 找並行程序的錯誤

Extending the Eclipse Parallel Tools Platform debugger with Scalable Parallel Debugging Library

Eclipse Guard: Relative Debugging in the Eclipse Framework

Data Reduction and Partitioning in an Extreme Scale GPU-Based Clustering Algorithm

Tree-Based Density Clustering using Graphics Processors

Philip C. Roth. Computer Science and Mathematics Division Oak Ridge National Laboratory

Scalable Tool Infrastructure for the Cray XT Using Tree-Based Overlay Networks

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

High Throughput, Low Impedance e-science on Microsoft Azure

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

The Red Storm System: Architecture, System Update and Performance Analysis

"Charting the Course to Your Success!" MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008

The Eclipse Parallel Tools Platform

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

A Lightweight Library for Building Scalable Tools

School of Computer Science and Software Engineering Monash University. A Relative Debugger for Eclipse

Executing dynamic heterogeneous workloads on Blue Waters with RADICAL-Pilot

Debugging and Optimizing Programs Accelerated with Intel Xeon Phi Coprocessors

Cray XC Scalability and the Aries Network Tony Ford

MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

UNIVERSITY OF CALIFORNIA, SAN DIEGO. Scalable Dynamic Instrumentation for Large Scale Machines

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters

HPC learning using Cloud infrastructure

Foster s Methodology: Application Examples

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

SLURM Operation on Cray XT and XE

Addressing Performance and Programmability Challenges in Current and Future Supercomputers

Charm++ for Productivity and Performance

What are Clusters? Why Clusters? - a Short History

Message-Passing Programming with MPI

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Reducing Cluster Compatibility Mode (CCM) Complexity

Bei Wang, Dmitry Prohorov and Carlos Rosales

A New. Applications LARGE scientific codes are constantly evolving. Refinements

Lightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

A Debugging Tool for Software Evolution

Revealing Applications Access Pattern in Collective I/O for Cache Management

Today. Operating System Evolution. CSCI 4061 Introduction to Operating Systems. Gen 1: Mono-programming ( ) OS Evolution Unix Overview

LLVM-based Communication Optimizations for PGAS Programs

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

CP2K Performance Benchmark and Profiling. April 2011

Scalable Interaction with Parallel Applications

User Training Cray XC40 IITM, Pune

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

Modeling Cone-Beam Tomographic Reconstruction U sing LogSMP: An Extended LogP Model for Clusters of SMPs

Today. Operating System Evolution. CSCI 4061 Introduction to Operating Systems. Gen 1: Mono-programming ( ) OS Evolution Unix Overview

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

CP2K Performance Benchmark and Profiling. April 2011

Distributed ASCI Supercomputer DAS-1 DAS-2 DAS-3 DAS-4 DAS-5

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

LAPI on HPS Evaluating Federation

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Operational Robustness of Accelerator Aware MPI

High-Performance Broadcast for Streaming and Deep Learning

DiPerF: automated DIstributed PERformance testing Framework

Performance and Scalability with Griddable.io

Resource allocation and utilization in the Blue Gene/L supercomputer

Workload Characterization using the TAU Performance System

Future Routing Schemes in Petascale clusters

Modelling and implementation of algorithms in applied mathematics using MPI

Intel Parallel Studio XE 2017 Composer Edition BETA C++ - Debug Solutions Release Notes

Compilation for Heterogeneous Platforms

Introduction to High Performance Computing and X10

Debugging Programs Accelerated with Intel Xeon Phi Coprocessors

The Cray Programming Environment. An Introduction

Overview. Processor organizations Types of parallel machines. Real machines

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Parallel Computer Architecture II

Evaluating New Communication Models in the Nek5000 Code for Exascale

Tile Processor (TILEPro64)

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

Data Management in Parallel Scripting

TAUoverMRNet (ToM): A Framework for Scalable Parallel Performance Monitoring

Optimization of MPI Applications Rolf Rabenseifner

EE/CSCI 451: Parallel and Distributed Computation

CSE5351: Parallel Processing Part III

WebSphere Application Server Base Performance

Parallel Programming. Presentation to Linux Users of Victoria, Inc. November 4th, 2015

Using Java for Scientific Computing. Mark Bul EPCC, University of Edinburgh

A Lightweight Library for Building Scalable Tools

WhatÕs New in the Message-Passing Toolkit

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

Intel MPI Cluster Edition on Graham A First Look! Doug Roberts

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

High Throughput WAN Data Transfer with Hadoop-based Storage

Delivers cost savings, high definition display, and supercharged sharing

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Fujitsu s Approach to Application Centric Petascale Computing

Lecture 23: I/O Redundant Arrays of Inexpensive Disks Professor Randy H. Katz Computer Science 252 Spring 1996

Bisection Debugging. 1 Introduction. Thomas Gross. Carnegie Mellon University. Preliminary version

Power Aware Hierarchical Epidemics in P2P Systems Emrah Çem, Tuğba Koç, Öznur Özkasap Koç University, İstanbul

Transcription:

Improving the Scalability of Comparative Debugging with MRNet Jin Chao MeSsAGE Lab (Monash Uni.) Cray Inc. David Abramson Minh Ngoc Dinh Jin Chao Luiz DeRose Robert Moench Andrew Gontarek

Outline Assertion-based comparative debugging The architecture of Guard Improving the scalability of Guard with MRNet Performance evaluation Conclusion and future work

Assertion-based Comparative Debugging

Challenge faced by parallel debugging (1) Cognitive challenge Large dataset on popular supercomputer: Cray Blue Water: > 1.5PB aggregated memory > 380,000 CPU cores A large set of scientific data that is non-readable to human : Multi-dimensional Floating-point What is the correct state?

Challenge faced by parallel debugging (2) Control-flow based parallel debuggers Limitations of visualization tools Errors in visualized presentations are hard to detect

Comparative debugging Data-centric debugging: focusing the data set in parallel programs Comparing state between programs Porting codes across different platforms Re-writing codes with different languages Software evolution: Modifying/improving existing codes Comparing state with users expectations Invariants based on the properties of scientific modelling or mathematical theories Verifying the correctness of computing status

Assertion for Comparative debugging An assertion is a statement about an intended state of a system s component. Assertion in Guard: An ad hoc debug-time assertion Simple assertion: 31: for(j = 0; j < n; j++) { 32: for (i = 0; i < m; i++) { 33: p[j][j] = 5000; }} Incorrect code 31: for(j = 0; j < n; j++) { 32: for (i = 0; i < m; i++) { 33: p[j][i] = 5000; }} Correct code assert $a::p@ source.c :33 = 5000

Assertion for Comparative debugging An assertion is a statement about an intended state of a system s component. Assertion in Guard: An ad hoc debug-time assertion Simple assertion: assert $a::var@ source.c :34 = 1024 Comparative assertions Observing the divergence in the key data structures as the programs execute assert $a::p_array@ prog1.c :34 = $b::p_array@ prog2.c :37 Statistical assertions Verifying the statistical properties of scientific modelling or mathematical theories: standard deviation histogram

Example of statistical assertion: histogram Histogram: user-defined abstract data model Creating the model with the two phase operation create $model=randset (Gaussian, 100000, 0.05) set reduce histogram(1000, 0.0, 1.0) assert $a::my_array@code.c:10~$model < 0.02 estimate operator: ~ 2 goodness of fit test

Implementation of assertions An assertion is compiled into a dataflow graph Dataflow machine set reduce sum assert $a::p_array@ prog.c :34 > 1,000,000 PROCSET($a) Set BKPT Go BKPT Hit Get VAR Reduce Sum(p_array) 1,000,000 Compare Exit

Dataflow graphs

The Architecture of Guard

Features of Guard (CCDB) A general parallel debugger Supporting C, FORTRAN and MPI A relative debugger Comparative assertions and statistical assertions Client/Server structure: Machine independent command line client Visual Studio Client for Windows SUN One Studio and IBM Eclipse Supporting servers from different architecture: Unix: SUN (Solaris), x86 (Linux), IBM RS6000 (AIX) Windows: Visual Studio.NET Architecture Independent Format (AIF)

The architecture of Guard CLI Front-end: Relative debugger Debug Client Network: Socket s 0 s 1 s 2 s n Back-end: GDB GDB GDB GDB p 0 p 1 p 2 p n

Features of MRNet Tree-Based Overlay Networks (TBONs) Scalable broadcast and gather Custom data aggregation

General purpose API of MRNet User-defined tree topology Topology file: k-ary, k-nomial and tailored layout Communicator A set of back-ends Stream A logical data channel over a communicator Multicast, gather and custom reduction Packet Collection of data Filter Modify data transferred across it WaitForAll, TimeOut, NoWait Startup: launching back-end processes

Improving the scalability of Guard with MRNet

The architecture of Guard with MRNet Command Line Interface Front-end: Comparative debugger Debug Client C Wrapper MRNet Front-end CP Guard filter CP CP s 0 s 1 s 2 MRNet BE s n Back-end: GDB GDB GDB GDB p 0 p 1 p 2 p n

Communication with MRNet Communication patterns in Guard General parallel debugger Topology Balanced, K-nomial Placement of MRNet internal nodes Requiring no additional resources procset: Communicator Channels for commands, events and I/O Command stream: synchronous channel Event and I/O Stream: asynchronous channel Aggregating redundant messages: WaitForAll filter: synchronous channel TimerOut filter: asynchronous channel

Creating a communication tree of MRNet Invoking a debug session on Cray: Front-end: Back-end: p 0 aprun apinit p n GDB Guard Client launch helper agent s 0 GDB s n MRNet FE

Comparative assertion with MRNet (1) assert $a::p_array@ prog1.c :34 = $b::p_array@ prog2.c :37 Front-end: Guard client FE 0 FE 1 CP CP CP CP CP CP Back-end: s 0 BE BE s m s 1 s n s 0 s 1

Comparative assertion with MRNet (2) Centralized comparison: s 0 s 1 s 2 s 3 Reconstruct and compare Front-end: s 2 s 3 MRNet MRNet Back-end: p 0 p 1 p 0 p 1 p 2 p 3 ds 0 ds 1 ds 2 ds 3 ds 0 ds 1 ds 2 ds 3

Data Reconstruction Currently support regular decompositions Blockmap: Assertions require the debugger to understand data decomposition blockmap test(p::v) define distribute(block, *) define data(1024, 1024) end p_array P 1 P 2 P 3 s_array P 4

Comparative assertion with MRNet (3) Point-to-point comparison: Comparison results Front-end: MRNet MRNet r 0 r 1 r 2 r 3 Back-end: p 0 p 1 p 0 p 1 p 2 p 3 d 0 d d 1 d 2 d d 3 d 0 d 1 d 2 d 3

Statistical assertion with MRNet Standard deviation: two phase operation Parallel: calculate a set of primary statistics Aggregation: form a full statistical model Front-end: Aggregation Guard client CP CP CP Back-end: s 0 s 1 s n ps 0 ps 1 ps n

Performance Evaluation

Performance Evaluation The configuration of MRNet (3.1.0) : A balanced topology Fan-out: 64 Test bed: Hera Cray XE6 Gemini 1.2 system 752 computing nodes, each with 32GB of memory. 21,760 CPU cores totally General Parallel Debugger SP (Scalar Pentagidgonal) of NAS Parallel Benchmarks (NPB) 3.3.1 Relative Debugger Comparative assertion: data examples from WRF Statistical assertion: molecular dynamics simulation

Time(second) Scalability of Invoke Command 18 16 14 12 Total-MRNet MRNet Instantiation MRNet Attachment MRNet Overall 10 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 The Number of Parallel Processes x 10 4

Time (second) Latency of Debugging Commands 10 0 All bkpt (MRNet) All step (MRNet) 10-1 10-2 10-3 0 2000 4000 6000 8000 10000 12000 14000 16000 The Number of Parallel Processes

Time (second) Message Aggregation A memory buffer of 256 bytes was added into the SP program. Its value was inspected under different degrees of aggregation (DoA). 10 0 DoA = 1(MRNet) DoA = 0.1(MRNet) DoA = 0.01(MRNet) 10-1 10-2 10-3 0 2000 4000 6000 8000 10000 12000 14000 16000 The Number of Parallel Processes

Comparative assertion 320 KBytes*10,000=3GB. The size of data is from WRF, a production climate model.

Time (second) Statistical assertion Molecular dynamics simulation: 209*10 6 atoms The atomistic mechanism of fracture accompanying structural phase transformation in AIN ceramic under hypervelocity impact. 10 2 Overall-Histogram Reduction-Histogram Overall-Stdev Reduction-Stdev 10 1 10 0 10-1 0 2000 4000 6000 8000 10000 12000 The Number of Parallel Processes

Conclusion and Future Work MRNet improves the scalability of Guard >20,000 debug servers The overhead of creating an MRNet tree Attachment: Instantiation: one BE process per core How to take advantage of computing capability of MRNet Comparison across multiple trees? Building a tree across multiple aprun? Programming filter for handling aggregations of statistical assertions The best way of maintaining C wrapper for C++ interface?

Related publications Dinh M.N., Abramson D., Jin C., Gontarek A., Moench R., and DeRose L., Debugging Scientific Applications With Statistical Assertion, in International Conference on Computational Science (ICCS), Omaha, USA, 2012. Dinh M.N., Abramson D., Jin C., Gontarek A., Moench R., and DeRose L., Scalable parallel debugging with statistical assertions. ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP) 2012, New Orleans, USA. (Poster session) Jin C., Abramson D., Dinh M.N., Gontarek A., Moench R., and DeRose L., A Scalable Parallel Debugging Library with Pluggable Communication Protocols, the 12th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGrid 2012),Ottwa, Canada, May 2012. Dinh M.N., Abramson D., M.N., Kurniawan, Jin C., Moench R., and DeRose L., Assertion based parallel debugging, the 11th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGrid 2011), Newport Beach,USA, May 2011. Abramson, D., Dinh, M.N., Kurniawan, D., Moench, B. and DeRose, L. Data Centric Highly Parallel Debugging, 2010 International Symposium on High Performance Distributed Computing (HPDC 2010), Chicago, USA, June 2010. Abramson D., Foster, I., Michalakes, J. and Sosic R., Relative Debugging and its Application to the Development of Large Numerical Models, Proceedings of IEEE Supercomputing 1995, San Diego, December 95. (Best paper award)

Thanks and Questions?