Improving the Scalability of Comparative Debugging with MRNet

Improving the Scalability of Comparative Debugging with MRNet Jin Chao MeSsAGE Lab (Monash Uni.) Cray Inc. David Abramson Minh Ngoc Dinh Jin Chao Luiz DeRose Robert Moench Andrew Gontarek

Outline Assertion-based comparative debugging The architecture of Guard Improving the scalability of Guard with MRNet Performance evaluation Conclusion and future work

Assertion-based Comparative Debugging

Challenge faced by parallel debugging (1) Cognitive challenge Large dataset on popular supercomputer: Cray Blue Water: > 1.5PB aggregated memory > 380,000 CPU cores A large set of scientific data that is non-readable to human : Multi-dimensional Floating-point What is the correct state?

Challenge faced by parallel debugging (2) Control-flow based parallel debuggers Limitations of visualization tools Errors in visualized presentations are hard to detect

Comparative debugging Data-centric debugging: focusing the data set in parallel programs Comparing state between programs Porting codes across different platforms Re-writing codes with different languages Software evolution: Modifying/improving existing codes Comparing state with users expectations Invariants based on the properties of scientific modelling or mathematical theories Verifying the correctness of computing status

Assertion for Comparative debugging An assertion is a statement about an intended state of a system s component. Assertion in Guard: An ad hoc debug-time assertion Simple assertion: 31: for(j = 0; j < n; j++) { 32: for (i = 0; i < m; i++) { 33: p[j][j] = 5000; }} Incorrect code 31: for(j = 0; j < n; j++) { 32: for (i = 0; i < m; i++) { 33: p[j][i] = 5000; }} Correct code assert $a::p@ source.c :33 = 5000

Assertion for Comparative debugging An assertion is a statement about an intended state of a system s component. Assertion in Guard: An ad hoc debug-time assertion Simple assertion: assert $a::var@ source.c :34 = 1024 Comparative assertions Observing the divergence in the key data structures as the programs execute assert $a::p_array@ prog1.c :34 = $b::p_array@ prog2.c :37 Statistical assertions Verifying the statistical properties of scientific modelling or mathematical theories: standard deviation histogram

Example of statistical assertion: histogram Histogram: user-defined abstract data model Creating the model with the two phase operation create $model=randset (Gaussian, 100000, 0.05) set reduce histogram(1000, 0.0, 1.0) assert $a::my_array@code.c:10~$model < 0.02 estimate operator: ~ 2 goodness of fit test

Implementation of assertions An assertion is compiled into a dataflow graph Dataflow machine set reduce sum assert $a::p_array@ prog.c :34 > 1,000,000 PROCSET($a) Set BKPT Go BKPT Hit Get VAR Reduce Sum(p_array) 1,000,000 Compare Exit

Dataflow graphs

The Architecture of Guard

Features of Guard (CCDB) A general parallel debugger Supporting C, FORTRAN and MPI A relative debugger Comparative assertions and statistical assertions Client/Server structure: Machine independent command line client Visual Studio Client for Windows SUN One Studio and IBM Eclipse Supporting servers from different architecture: Unix: SUN (Solaris), x86 (Linux), IBM RS6000 (AIX) Windows: Visual Studio.NET Architecture Independent Format (AIF)

The architecture of Guard CLI Front-end: Relative debugger Debug Client Network: Socket s 0 s 1 s 2 s n Back-end: GDB GDB GDB GDB p 0 p 1 p 2 p n

Features of MRNet Tree-Based Overlay Networks (TBONs) Scalable broadcast and gather Custom data aggregation

General purpose API of MRNet User-defined tree topology Topology file: k-ary, k-nomial and tailored layout Communicator A set of back-ends Stream A logical data channel over a communicator Multicast, gather and custom reduction Packet Collection of data Filter Modify data transferred across it WaitForAll, TimeOut, NoWait Startup: launching back-end processes

Improving the scalability of Guard with MRNet

The architecture of Guard with MRNet Command Line Interface Front-end: Comparative debugger Debug Client C Wrapper MRNet Front-end CP Guard filter CP CP s 0 s 1 s 2 MRNet BE s n Back-end: GDB GDB GDB GDB p 0 p 1 p 2 p n

Communication with MRNet Communication patterns in Guard General parallel debugger Topology Balanced, K-nomial Placement of MRNet internal nodes Requiring no additional resources procset: Communicator Channels for commands, events and I/O Command stream: synchronous channel Event and I/O Stream: asynchronous channel Aggregating redundant messages: WaitForAll filter: synchronous channel TimerOut filter: asynchronous channel

Creating a communication tree of MRNet Invoking a debug session on Cray: Front-end: Back-end: p 0 aprun apinit p n GDB Guard Client launch helper agent s 0 GDB s n MRNet FE

Comparative assertion with MRNet (1) assert $a::p_array@ prog1.c :34 = $b::p_array@ prog2.c :37 Front-end: Guard client FE 0 FE 1 CP CP CP CP CP CP Back-end: s 0 BE BE s m s 1 s n s 0 s 1

Comparative assertion with MRNet (2) Centralized comparison: s 0 s 1 s 2 s 3 Reconstruct and compare Front-end: s 2 s 3 MRNet MRNet Back-end: p 0 p 1 p 0 p 1 p 2 p 3 ds 0 ds 1 ds 2 ds 3 ds 0 ds 1 ds 2 ds 3

Data Reconstruction Currently support regular decompositions Blockmap: Assertions require the debugger to understand data decomposition blockmap test(p::v) define distribute(block, *) define data(1024, 1024) end p_array P 1 P 2 P 3 s_array P 4

Comparative assertion with MRNet (3) Point-to-point comparison: Comparison results Front-end: MRNet MRNet r 0 r 1 r 2 r 3 Back-end: p 0 p 1 p 0 p 1 p 2 p 3 d 0 d d 1 d 2 d d 3 d 0 d 1 d 2 d 3

Statistical assertion with MRNet Standard deviation: two phase operation Parallel: calculate a set of primary statistics Aggregation: form a full statistical model Front-end: Aggregation Guard client CP CP CP Back-end: s 0 s 1 s n ps 0 ps 1 ps n

Performance Evaluation

Performance Evaluation The configuration of MRNet (3.1.0) : A balanced topology Fan-out: 64 Test bed: Hera Cray XE6 Gemini 1.2 system 752 computing nodes, each with 32GB of memory. 21,760 CPU cores totally General Parallel Debugger SP (Scalar Pentagidgonal) of NAS Parallel Benchmarks (NPB) 3.3.1 Relative Debugger Comparative assertion: data examples from WRF Statistical assertion: molecular dynamics simulation

Time(second) Scalability of Invoke Command 18 16 14 12 Total-MRNet MRNet Instantiation MRNet Attachment MRNet Overall 10 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 The Number of Parallel Processes x 10 4

Time (second) Latency of Debugging Commands 10 0 All bkpt (MRNet) All step (MRNet) 10-1 10-2 10-3 0 2000 4000 6000 8000 10000 12000 14000 16000 The Number of Parallel Processes

Time (second) Message Aggregation A memory buffer of 256 bytes was added into the SP program. Its value was inspected under different degrees of aggregation (DoA). 10 0 DoA = 1(MRNet) DoA = 0.1(MRNet) DoA = 0.01(MRNet) 10-1 10-2 10-3 0 2000 4000 6000 8000 10000 12000 14000 16000 The Number of Parallel Processes

Comparative assertion 320 KBytes*10,000=3GB. The size of data is from WRF, a production climate model.

Time (second) Statistical assertion Molecular dynamics simulation: 209*10 6 atoms The atomistic mechanism of fracture accompanying structural phase transformation in AIN ceramic under hypervelocity impact. 10 2 Overall-Histogram Reduction-Histogram Overall-Stdev Reduction-Stdev 10 1 10 0 10-1 0 2000 4000 6000 8000 10000 12000 The Number of Parallel Processes

Conclusion and Future Work MRNet improves the scalability of Guard >20,000 debug servers The overhead of creating an MRNet tree Attachment: Instantiation: one BE process per core How to take advantage of computing capability of MRNet Comparison across multiple trees? Building a tree across multiple aprun? Programming filter for handling aggregations of statistical assertions The best way of maintaining C wrapper for C++ interface?

Related publications Dinh M.N., Abramson D., Jin C., Gontarek A., Moench R., and DeRose L., Debugging Scientific Applications With Statistical Assertion, in International Conference on Computational Science (ICCS), Omaha, USA, 2012. Dinh M.N., Abramson D., Jin C., Gontarek A., Moench R., and DeRose L., Scalable parallel debugging with statistical assertions. ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP) 2012, New Orleans, USA. (Poster session) Jin C., Abramson D., Dinh M.N., Gontarek A., Moench R., and DeRose L., A Scalable Parallel Debugging Library with Pluggable Communication Protocols, the 12th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGrid 2012),Ottwa, Canada, May 2012. Dinh M.N., Abramson D., M.N., Kurniawan, Jin C., Moench R., and DeRose L., Assertion based parallel debugging, the 11th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGrid 2011), Newport Beach,USA, May 2011. Abramson, D., Dinh, M.N., Kurniawan, D., Moench, B. and DeRose, L. Data Centric Highly Parallel Debugging, 2010 International Symposium on High Performance Distributed Computing (HPDC 2010), Chicago, USA, June 2010. Abramson D., Foster, I., Michalakes, J. and Sosic R., Relative Debugging and its Application to the Development of Large Numerical Models, Proceedings of IEEE Supercomputing 1995, San Diego, December 95. (Best paper award)

Thanks and Questions?