MPI Application Development with MARMOT

Similar documents
Runtime Checking of MPI Applications with MARMOT

Practical Course Scientific Computing and Visualization

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Outline. Communication modes MPI Message Passing Interface Standard. Khoa Coâng Ngheä Thoâng Tin Ñaïi Hoïc Baùch Khoa Tp.HCM

Parallel programming MPI

MPI. (message passing, MIMD)

Outline. Communication modes MPI Message Passing Interface Standard

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III)

MPI Runtime Error Detection with MUST

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

Standard MPI - Message Passing Interface

Recap of Parallelism & MPI

High Performance Computing

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

MPI Application Development Using the Analysis Tool MARMOT

MPI. What to Learn This Week? MPI Program Structure. What is MPI? This week, we will learn the basics of MPI programming.

CINES MPI. Johanne Charpentier & Gabriel Hautreux

Distributed Memory Programming with MPI

Programming SoHPC Course June-July 2015 Vladimir Subotic MPI - Message Passing Interface

Data parallelism. [ any app performing the *same* operation across a data stream ]

Parallel Programming

Non-Blocking Communications

Introduction to MPI. Jerome Vienne Texas Advanced Computing Center January 10 th,

MPI Correctness Checking with MUST

Message Passing Interface

Introduction to MPI. Ritu Arora Texas Advanced Computing Center June 17,

Advanced MPI. Andrew Emerson

Message Passing Interface

Parallel Debugging. Matthias Müller, Pavel Neytchev, Rainer Keller, Bettina Krammer

HPC Parallel Programing Multi-node Computation with MPI - I

MPI MESSAGE PASSING INTERFACE

COSC 6374 Parallel Computation. Message Passing Interface (MPI ) I Introduction. Distributed memory machines

MPI Message Passing Interface. Source:

Non-Blocking Communications

MPI Runtime Error Detection with MUST and Marmot For the 8th VI-HPS Tuning Workshop

DEADLOCK DETECTION IN MPI PROGRAMS

Lecture 9: MPI continued

Introduction to parallel computing

CS 6230: High-Performance Computing and Parallelization Introduction to MPI

High-Performance Computing: MPI (ctd)

MPI Runtime Error Detection with MUST

MPI 5. CSCI 4850/5850 High-Performance Computing Spring 2018

More about MPI programming. More about MPI programming p. 1

MA471. Lecture 5. Collective MPI Communication

Introduction to MPI. HY555 Parallel Systems and Grids Fall 2003

Practical Introduction to Message-Passing Interface (MPI)

Message Passing Interface. most of the slides taken from Hanjun Kim

Lecture 7: More about MPI programming. Lecture 7: More about MPI programming p. 1

CS 179: GPU Programming. Lecture 14: Inter-process Communication

Slides prepared by : Farzana Rahman 1

Optimization of MPI Applications Rolf Rabenseifner

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

CS 426. Building and Running a Parallel Application

MPI Workshop - III. Research Staff Cartesian Topologies in MPI and Passing Structures in MPI Week 3 of 3

Message Passing with MPI

Introduction to the Message Passing Interface (MPI)

Parallel Programming in C with MPI and OpenMP

MPI Collective communication

MPI point-to-point communication

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

Programming with MPI Collectives

MPI MESSAGE PASSING INTERFACE

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Introduction to MPI: Part II

Parallel Programming

First day. Basics of parallel programming. RIKEN CCS HPC Summer School Hiroya Matsuba, RIKEN CCS

Tool for Analysing and Checking MPI Applications

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

Coupling DDT and Marmot for Debugging of MPI Applications

Advanced MPI. Andrew Emerson

Parallel Programming

Introduction to MPI. Ricardo Fonseca.

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Collective Communication in MPI and Advanced Features

Review of MPI Part 2

In the simplest sense, parallel computing is the simultaneous use of multiple computing resources to solve a problem.

MPI Tutorial. Shao-Ching Huang. IDRE High Performance Computing Workshop

An Introduction to Parallel Programming

Distributed Memory Parallel Programming

NAME MPI_Address - Gets the address of a location in memory. INPUT PARAMETERS location - location in caller memory (choice)

Intermediate MPI. M. D. Jones, Ph.D. Center for Computational Research University at Buffalo State University of New York

Part - II. Message Passing Interface. Dheeraj Bhardwaj

Introduction to MPI Programming Part 2

Topics. Lecture 6. Point-to-point Communication. Point-to-point Communication. Broadcast. Basic Point-to-point communication. MPI Programming (III)

MPI MESSAGE PASSING INTERFACE

DPHPC Recitation Session 2 Advanced MPI Concepts

Introduction to MPI. SuperComputing Applications and Innovation Department 1 / 143

Practical Scientific Computing: Performanceoptimized

Message Passing Interface - MPI

Outline. Introduction to HPC computing. OpenMP MPI. Introduction. Understanding communications. Collective communications. Communicators.

Introduction to parallel computing concepts and technics

Basic MPI Communications. Basic MPI Communications (cont d)

PROGRAMMING WITH MESSAGE PASSING INTERFACE. J. Keller Feb 26, 2018

AMath 483/583 Lecture 18 May 6, 2011

MPI: Message Passing Interface An Introduction. S. Lakshmivarahan School of Computer Science

Welcome. HRSK Practical on Debugging, Zellescher Weg 12 Willers-Bau A106 Tel

lslogin3$ cd lslogin3$ tar -xvf ~train00/mpibasic_lab.tar cd mpibasic_lab/pi cd mpibasic_lab/decomp1d

Distributed Memory Programming with MPI

Transcription:

MPI Application Development with MARMOT Bettina Krammer University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Matthias Müller University of Dresden Centre for Information Services and High Performance Computing (ZIH) www.tu-dresden.de/zih Outline Motivation Approaches and Tools Examples Future work and comparison with other tools Conclusion Slide 2

Motivation Slide 3 Motivation - Problems of Parallel Programming Additional problems: Increased complexity New parallel problems: deadlocks race conditions Irreproducibility Portability issues: MPI standard leaves some decisions to implementers, e.g. Message buffering for send/recv operations Synchronising collective calls Implementation of opaque objects Slide 4

Approaches & Tools Slide 5 (Parallel) Debuggers Most vendor debuggers have some support Debugging MPI programs with a serial debugger is hard but possible MPIch supports debugging with gdb attached to one process manual attaching to the processes is possible Commercial debuggers: Totalview, DDT Slide 6

Special MPI tools & libraries MPI implementations offer (limited) support, e.g. NEC Collectives Verification Library Only for NEC MPI/SX, MPI/EX mpich2 profiling library for collective functions Checks correctness of collective calls and datatypes Portable library Tools specially dedicated to analysis of MPI applications: MPI-Check Umpire IMC Marmot Slide 7 MARMOT MPI Analysis and Checking Tool Bettina Krammer University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de

What is MARMOT? Tool for the development of MPI applications Automatic runtime analysis of the application: Detect incorrect use of MPI Detect non-portable constructs Detect possible race conditions and deadlocks MARMOT does not require source code modifications, just relinking C and Fortran binding of MPI -1.2 is supported, also C++ and mixed C/Fortran code Development is still ongoing (not every possible functionality is implemented yet ) Tool makes use of the so-called profiling interface Slide 9 Design of MARMOT Application or Test Program Profiling Interface MARMOT core tool Debug Server (additional process) MPI library Slide 10

Examples of Server Checks: verification between the nodes, control of program Everything that requires a global view Control the execution flow, trace the MPI calls on each node throughout the whole application Signal conditions, e.g. deadlocks (with traceback on each node.) Check matching send/receive pairs for consistency Check collective calls for consistency Output of human readable log file Slide 11 Examples of Client Checks: verification on the local nodes Verification of proper construction and usage of MPI resources such as communicators, groups, datatypes etc., for example Verification of MPI_Request usage invalid recycling of active request invalid use of unregistered request warning if number of requests is zero warning if all requests are MPI_REQUEST_NULL Check for pending messages and active requests in MPI_Finalize Verification of all other arguments such as ranks, tags, etc. Slide 12

Availability of MARMOT Tests on different platforms, using different compilers (Intel, GNU, ) and MPI implementations (mpich, lam, vendor MPIs, ), e.g. IA32/IA64 clusters Opteron clusters Xeon EM64T clusters IBM Regatta NEC SX5,, SX8 Download and further information http://www.hlrs.de/organization/tsc/projects/marmot/ Slide 13 Examples Slide 14

Example - Medical Application B_Stream Calculation of blood flow with 3D Lattice-Boltzmann method 16 different MPI calls: MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Pack, MPI_Bcast, MPI_Unpack, MPI_Cart_create, MPI_Cart_shift, MPI_Cart_rank, MPI_Send, MPI_Recv, MPI_Barrier, MPI_Reduce, MPI_Sendrecv, MPI_Wtime, MPI_Finalize Around 6500 lines of code We use different input files that describe the geometry of the artery: tube, tube-stenosis, bifurcation Slide 15 Example: B_Stream (blood flow simulation, tube) Tube geometry: simplest case, just a tube with about the same radius everywhere Running the application without/with MARMOT: mpirun -np 3 B_Stream 500. tube mpirun -np 4 B_Stream_marmot 500. tube Application seems to run without problems Slide 16

Example: B_Stream (blood flow simulation, tube-stenosis) Tube-stenosis geometry: just a tube with varying radius Without MARMOT: mpirun -np 3 B_Stream 500. tube-stenosis Application seems to be hanging With MARMOT: mpirun -np 4 B_Stream_marmot 500. tubestenosis Deadlock found Slide 17 Example: B_Stream (blood flow simulation, tube-stenosis) 9310 rank 1 performs MPI_Sendrecv 9311 rank 2 performs MPI_Sendrecv 9312 rank 0 performs MPI_Barrier 9313 rank 1 performs MPI_Barrier 9314 rank 2 performs MPI_Barrier 9315 rank 1 performs MPI_Sendrecv 9316 rank 2 performs MPI_Sendrecv 9317 rank 0 performs MPI_Sendrecv 9318 rank 1 performs MPI_Sendrecv 9319 rank 0 performs MPI_Sendrecv 9320 rank 2 performs MPI_Sendrecv 9321 rank 0 performs MPI_Barrier 9322 rank 1 performs MPI_Barrier 9323 rank 2 performs MPI_Barrier 9324 rank 1 performs MPI_Comm_rank 9325 rank 1 performs MPI_Bcast 9326 rank 2 performs MPI_Comm_rank 9327 rank 2 performs MPI_Bcast 9328 rank 0 performs MPI_Sendrecv WARNING: all clients are pending! Iteration step: step: Calculate and and exchange results results with with neighbors Communicate results results among among all all procs procs Slide 18

Node 0 Node 1 Node 2 Example: B_Stream (blood flow simulation, tube-stenosis) deadlock timestamp= 9319: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9321: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9328: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9322: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9324: MPI_Comm_rank(comm = MPI_COMM_WORLD, *rank) timestamp= 9325: MPI_Bcast(*buffer, count = 3, datatype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) timestamp= 9323: MPI_Barrier(comm = MPI_COMM_WORLD) timestamp= 9326: MPI_Comm_rank(comm = MPI_COMM_WORLD, *rank) timestamp= 9327: MPI_Bcast(*buffer, count = 3, datatype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) Slide 19 Example: B_Stream (blood flow if simulation, tube-stenosis) if (radius (radius < x) x) num_iter = 50; 50; Code Analysis if if (radius (radius >= >= x) x) num_iter = 200; 200; // // ERROR: it it is is not not ensured ensured here here that that all all main { // // procs procsdo do the the same same (maximal) number number // // of of iterations num_iter = calculate_number_of_iterations(); for (i=0; i < num_iter; i++) { CalculateSomething(); computebloodflow(); // // exchange results results with with neighbors } MPI_Sendrecv( ); MPI_Barrier( ); writeresults();. // // communicate results results with with neighbors MPI_Bcast( ); } Be careful if you call functions with hidden MPI calls! Slide 20

Example: B_Stream (blood flow simulation, bifurcation) Bifurcation geometry: forked artery Without MARMOT: mpirun -np 3 B_Stream 500. bifurcation Segmentation fault (platform dependent if the code breaks here or not) With MARMOT: mpirun -np 4 B_Stream_marmot 500. bifurcation Problem found at collective call MPI_Gather Slide 21 Example: B_Stream (blood flow simulation, bifurcation) Last calls on node 0: timestamp= 9327: MPI_Bcast(*buffer, count = 3, datatype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) timestamp= 9330: MPI_Bcast(*buffer, count = 3, datatype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) timestamp= 9333: MPI_Gather(*sendbuf, sendcount = 266409, sendtype = MPI_DOUBLE, *recvbuf, recvcount = 266409, recvtype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) Last calls on node 1: timestamp= 9334: MPI_Gather(*sendbuf, sendcount = 258336, sendtype = MPI_DOUBLE, *recvbuf, recvcount = 258336, recvtype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD) timestamp= 9336: MPI_Sendrecv(*sendbuf, sendcount = 13455, sendtype = MPI_DOUBLE, ERROR: Root 0 dest has different = 0, sendtag = 1, *recvbuf, recvcount = counts 13455, than rank recvtype 1 and 2 = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status) timestamp= 9338: MPI_Sendrecv(*sendbuf, sendcount = 13455, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 13455, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status) Slide 22

MARMOT Performance with real applications Slide 23 Air pollution modelling Air pollution modeling with STEM-II model Transport equation solved with Petrov- Crank-Nikolson-Galerkin method Chemistry and Mass transfer are integrated using semi-implicit Euler and pseudoanalytical methods 15500 lines of Fortran code 12 different MPI calls: MPI_Init, MPI_Comm_size, MPI_Comm_rank, MPI_Type_extent, MPI_Type_struct, MPI_Type_commit, MPI_Type_hvector, MPI_Bcast, MPI_Scatterv, MPI_Barrier, MPI_Gatherv, MPI_Finalize. Slide 24

STEM application on an IA32 cluster with Myrinet 80 70 60 Time [s] 50 40 30 native MPI MARMOT 20 10 0 1 2 3 4 5 6 Processors 7 8 9 10 1112 13 1415 16 Slide 25 Comparison with other approaches and Future Directions Slide 26

Future Direction of MARMOT MARMOT will continue to be developed jointly by ZIH and HLRS Slide 27 Future Directions Functionality -More checks -MPI-2 -OpenMP/MPI Usability -GUI Performance - Scalability Time [s] 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Processors Combination with other tools -Debugger -Vampir Slide 28

Online vs. Offline Checking Offline Checking (Intel Message Checker): + History information available + Less intrusive at runtime - Large file space needed for trace file - Limited scalability of analysis Online Checking (Marmot): - Reduced performance of running application - Only limited history/time line information + No limit for the runtime of the program + The program is still alive when the error is detected, further online analysis is possible Slide 29 MPI implementation with checking + Scalability and performance + Avoids duplicating internal management tasks + Avoids re-implementation of existing functionality - No history/time line information - Performance and reliability is always more important than checks - Portability problems are not a major focus Slide 30

Comparison of systems Online Offline Library Checks ++ ++ 0 Scalability and Performance +- +- ++ Runtime scalability ++ - ++ Timeline information - ++ -- Combination with other tools + - - Slide 31 Conclusion Parallel programming in general and the MPI standard contains enough pitfalls to raise a need for checking/correctness/confidence tools Different tools and approaches with different advantages and disadvantages MARMOT is a freely available solution that has demonstrated its usefulness with various real applications A combination of tools will offer the best solution for the program developper Not every error can be detected by tools Slide 32

Thanks for your attention Slide 33