Process 0 Process 1 MPI_Barrier MPI_Isend. MPI_Barrier. MPI_Recv. MPI_Wait. MPI_Isend message. header. MPI_Recv. buffer. message.

Similar documents
Small Matrices fit into cache. Large Matrices do not fit into cache. Performance (MFLOPS) Performance (MFLOPS) bcsstk20 blckhole e05r0000 watson5

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

ARCHITECTURE SPECIFIC COMMUNICATION OPTIMIZATIONS FOR STRUCTURED ADAPTIVE MESH-REFINEMENT APPLICATIONS

Comparing the performance of MPICH with Cray s MPI and with SGI s MPI

Blocking vs. Non-blocking Communication under. MPI on a Master-Worker Problem. Institut fur Physik. TU Chemnitz. D Chemnitz.

Parallel Pipeline STAP System

The MPBench Report. Philip J. Mucci. Kevin London. March 1998

Optimization of MPI Applications Rolf Rabenseifner

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Discussion: MPI Basic Point to Point Communication I. Table of Contents. Cornell Theory Center

On Partitioning Dynamic Adaptive Grid Hierarchies. Manish Parashar and James C. Browne. University of Texas at Austin

WhatÕs New in the Message-Passing Toolkit

MICE: A Prototype MPI Implementation in Converse Environment

Part - II. Message Passing Interface. Dheeraj Bhardwaj

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network

6.1 Multiprocessor Computing Environment

Parallelizing a seismic inversion code using PVM: a poor. June 27, Abstract

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Understanding MPI on Cray XC30

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

DATABASE SCALABILITY AND CLUSTERING

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Java Virtual Machine

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Unit 9 : Fundamentals of Parallel Processing

MPI as a Coordination Layer for Communicating HPF Tasks

Multicast can be implemented here

MMPI: Asynchronous Message Management for the. Message-Passing Interface. Harold Carter Edwards. The University of Texas at Austin

Parallel Programming

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Algorithm Engineering with PRAM Algorithms

The Use of the MPI Communication Library in the NAS Parallel Benchmarks

LAPI on HPS Evaluating Federation

MPI Casestudy: Parallel Image Processing

The driving motivation behind the design of the Janus framework is to provide application-oriented, easy-to-use and ecient abstractions for the above

David Cronk University of Tennessee, Knoxville, TN

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany.

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Flow simulation. Frank Lohmeyer, Oliver Vornberger. University of Osnabruck, D Osnabruck.

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t

A Freely Congurable Audio-Mixing Engine. M. Rosenthal, M. Klebl, A. Gunzinger, G. Troster

December 28, Abstract. In this report we describe our eorts to parallelize the VGRIDSG unstructured surface

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

2 Rupert W. Ford and Michael O'Brien Parallelism can be naturally exploited at the level of rays as each ray can be calculated independently. Note, th

Parallel Computing Platforms

Server 1 Server 2 CPU. mem I/O. allocate rec n read elem. n*47.0. n*20.0. select. n*1.0. write elem. n*26.5 send. n*

EXPERIMENTS WITH REPARTITIONING AND LOAD BALANCING ADAPTIVE MESHES

n = 2 n = 1 µ λ n = 0

Real-time communication scheduling in a multicomputer video. server. A. L. Narasimha Reddy Eli Upfal. 214 Zachry 650 Harry Road.

A Message Passing Standard for MPP and Workstations

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM

CAF versus MPI Applicability of Coarray Fortran to a Flow Solver

Introduction to Parallel Computing

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

1e+07 10^5 Node Mesh Step Number

The PAPI Cross-Platform Interface to Hardware Performance Counters

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Grand Central Dispatch

Clusters. Mario Lauria. Dipartimento di Informatica e Sistemistica. Universita di Napoli \Federico II" via Claudio Napoli, Italy

682 M. Nordén, S. Holmgren, and M. Thuné

Compiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz

Component-Based Communication Support for Parallel Applications Running on Workstation Clusters

Blocking SEND/RECEIVE

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Parallel programming models. Main weapons

Lecture 3: Intro to parallel machines and models

High Performance Computing

One-Sided Routines on a SGI Origin 2000 and a Cray T3E-600. Glenn R. Luecke, Silvia Spanoyannis, Marina Kraeva

Chapter 7 The Potential of Special-Purpose Hardware

CPS343 Parallel and High Performance Computing Project 1 Spring 2018

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP

Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907

Nils Nieuwejaar, David Kotz. Most current multiprocessor le systems are designed to use multiple disks

clients (compute nodes) servers (I/O nodes)

Lecture 7: More about MPI programming. Lecture 7: More about MPI programming p. 1

task object task queue

PE PE PE. Network Interface. Processor Pipeline + Register Windows. Thread Management & Scheduling. N e t w o r k P o r t s.

Accelerated Library Framework for Hybrid-x86

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Performance Prediction for Parallel Local Weather Forecast Programs

On the scalability of tracing mechanisms 1

Parallel and High Performance Computing CSE 745

Module 5 Introduction to Parallel Processing Systems

Lecture 9: MIMD Architectures

Transcription:

Where's the Overlap? An Analysis of Popular MPI Implementations J.B. White III and S.W. Bova Abstract The MPI 1:1 denition includes routines for nonblocking point-to-point communication that are intended to support the overlap of communication with computation. We describe two experiments that test the ability of MPI implementations to actually perform this overlap. One experiment tests synchronization overlap, and the other tests data-transfer overlap. We give results for vendorsupplied MPI implementations on the CRAY T3E, IBM SP, and SGI Origin2000 at the CEWES MSRC, along with results for MPICH on the T3E. All the implementations show full support for synchronization overlap. Conversely, none of them support data-transfer overlap at the level needed for signicant performance improvement in our experiment. We suggest that programming for overlap may not be worthwhile for a broad class of parallel applications using many current MPI implementations. Keywords Message passing, MPI, nonblocking communication, overlap, parallel performance, CRAY T3E, IBM SP, SGI Origin, MPICH. I. Introduction OVERLAPPING interprocessor communication with useful computation is a well-known strategy for improving the performance of parallel applications, and the Message-Passing Interface (MPI) is a popular programming environment for portable parallel applications [1]. The MPI 1:1 denition includes routines for nonblocking point-to-point communication that are intended to support the overlap of communication with computation [2]. Separate routines for starting communication and completing communication allow the passing of s to proceed concurrently with computation. Though such overlap of communication and computation motivated the design of nonblocking communication in MPI, the MPI 1:1 standard does not require that an MPI implementation actually perform communication and computation concurrently [2]. We describe experiments that illustrate the degree to which MPI implementations support overlap and the degree to which programming for overlap can improve actual performance. We present results for popular MPI implementations on parallel systems available at the CEWES MSRC 1 : the Cray Message Passing Toolkit (MPT) and MPICH on the CRAY T3E, the IBM Parallel Environment (PE) ontheibmsp, and MPT for IRIX on the SGI Origin2000. The two experiments we describe are intended to repre- Trey White is with Oak Ridge National Laboratory, Oak Ridge, Tennessee. E-mail: trey@ccs.ornl.gov. Steve Bova is with Mississippi State University, Mississippi State, Mississippi. E-mail: swb@erc.msstate.edu. 1 Department of Defense (DoD) High Performance Computing Modernization Program (HPCMP) Corps of Engineers Waterways Experiment Station (CEWES) Major Shared Resource Center (MSRC), in Vicksburg, Mississippi. sent two broad categories of parallel applications: asynchronous and synchronous. Asynchronous applications are those where dierent processes perform communication at signicantly dierent times or rates. Examples of asynchronous applications include boss-worker, clientserver, metacomputing, and other MPMD 2 applications. Conversely, synchronous applications are those where all the processes perform communication at approximately the same time and rate. SPMD 3 applications often qualify as synchronous. Just as parallel applications can fall into two categories, MPI implementations can support two dierent levels of overlap: synchronization overlap and data-transfer overlap. Point-to-point passing is explicitly two sided one side sends a, and the other side receives it. If an MPI implementation can send a without requiring the two sides to synchronize, it supports overlap of synchronization with computation. This level of overlap does not necessarily imply that the implementation performs useful computation while data are actually in transit between processes. If an implementation performs this physical transfer of data concurrently with computation, it supports overlap of data transfer with computation. For each of these two levels of overlap, we describe an experiment used to evaluate the MPI implementations at the CEWES MSRC. We analyze the results for each experiment, describe each system and MPI implementation, and discuss the implications for asynchronous and synchronous applications. We conclude with a summary of the results and comments on the eectiveness of programming explicitly for overlap. II. Overlap of Computation with Synchronization MPI implementations that do not support the overlap of computation with synchronization must require the two processes involved in a to synchronize. Figure 1 presents aschematic of an experiment to test support for synchronization overlap in MPI implementations. The experiment measures the completion time for a single. Immediately after a mutual MPI Barrier, the sending process issues an \immediate" send, MPI Isend. As the name implies, the immediate send returns immediately, allowing execution to continue. This process \computes" for 4 seconds before issuing an MPI Wait, which blocks execution until the is actually sent. The receiving 2 Multiple Program, Multiple Data. 3 Single Program, Single Data.

MPI_Barrier 4 seconds MPI_Barrier 2 seconds Now? Overlap Or now? No overlap buffer header Fig. 1. An experiment to test support for synchronization overlap. Fig. 3. Synchronization overlap on the CRAY T3E for s smaller than available memory. IBM SP SGI Origin2000 CRAY T3E (MPT and MPICH) header 1 10 100 1000 10000 100000 Message size (kb) 2 seconds (overlap) 4 seconds (no overlap) Fig. 2. Results of the synchronization-overlap experiment in the default execution environment ofeach system. process waits for 2 seconds after the MPI Barrier before issuing an MPI Recv, which blocks execution until the arrives. The \result" of the experiment is the completion time of the MPI Recv after the MPI Barrier. For MPI implementations that support synchronization overlap, the MPI Recv completes almost immediately, yielding a time of about 2 seconds. For MPI implementations that do not support this overlap, the two processes must synchronize. The MPI Recv does not complete until the MPI Wait executes, yielding a time of about 4 seconds. Figure 2 displays the results of the experiment for a range of sizes. The results represent the default environment oneachmachine no environment variables have been modied. For all sizes in the gure, the time for data transfer is at least an order of magnitude smaller than the multiple-second delay in the experiment. Therefore, any time for data transfer does not contribute significantly to the MPI Recv time. The change from a 2-second to a 4-second delay indicates a change from synchronization overlap to no overlap. Each system shows a distinctive range of sizes where overlap is supported by default. The dierences in the ranges reect dierences in the MPI implementations. In addition, each system has a unique method for increasing the range of overlap well beyond the default range shown Fig. 4. No overlap on the CRAY T3E for s larger than available memory. in gure 2. A. Cray MPT and MPICH on the CRAY T3E The CRAY T3Eisamultiprocessor with a single system image and distributed, globally addressable memory [3]. Both the vendor-supplied MPI and MPICH support synchronization overlap up to very large sizes. Figure 3 models how the vendor-supplied MPI implements synchronization overlap. The CRAY T3E supports onesided communication each processing element can access the memory of a remote processing element without the involvement ofthe remote CPU [3]. Using this one-sided communication, the MPI Isend writes a header in memory local to the receiving process [4]. It also make a copy of the in a local buer. The MPI Recv then uses the header information and one-sided communication to read the contents of the remote buer. For this procedure to work, the sending process must have enough free memory to allocate the buer. Figure 4 models how thevendor-supplied MPI implements passing when the is too large to copyinto a buer. As before, the MPI Isend writes a header to the receiving process. The MPI Recv blocks until the sender calls MPI Wait, however, eliminating the opportunity for overlap. The range of sizes allowing overlap is limited only by available buer space, and thus only by available memory. Increasing available memory increases the range

eager buffer Fig. 5. Synchronization overlap on the IBM SP using eager communication. Fig. 6. No overlap on the IBM SP for s larger than the eager limit. accordingly. It is interesting to ask why the vendor-supplied MPI bothers to buer the outgoing. A cynical though possibly accurate answer is that the buering makes the MPI implementation \streams safe". Earlymodelsofthe T3E have a limitation on the use of the interprocessor communication hardware with the unique stream-buer hardware used to improve memory bandwidth [5]. These early T3Es can crash if both these hardware subsystems attempt to access nearby memory locations at the same time. The vendor-supplied MPI either buers each or blocks execution, avoiding all dangerous memory accesses. More modern T3Es, such as the one at the CEWES MSRC, do not have this hardware limitation and do not require buering for safety. Communication-bandwidth experiments described in [6] imply that the vendor-supplied MPI has not been modied for the improved hardware. B. IBM PE on the IBM SP The IBM SP is a multicomputer each node has its own system image and local memory [7]. By default, the vendor-supplied MPI implementation on the SP supports synchronization overlap only for s up to tens of kilobytes. The support for overlap ends at the \eager limit," which is dened by the environment [8]. Figure 5 models how the vendor-supplied MPI implements synchronization overlap through \eager" communication. Each process allocates a buer in local memory for s from each of the other processes. The MPI Isend writes the to the buer space assigned to the sending process in the local memory of the receiving process. The MPI Recv then copies the from this buer. Figure 6 models how the vendor-supplied MPI implements passing for s larger than the eager limit, in the default environment. The MPI Isend contributes little to actual data transfer, and the MPI Recv simply blocks until the MPI Wait executes. In other words, the processes must synchronize. One method of increasing the range of overlap is simply increasing the eager limit using the MP EAGER LIMIT environment variable [8]. Increasing the eager limit is not a scalable solution, however. To receive eager sends, each process must have separate buer space for every other pro- thread Fig. 7. Synchronization overlap on the IBM SP with MP CSS INTERRUPT set to yes. cess. Therefore, the total buer space grows as the square of the number of processes. Even if more than enough buer space is available, the IBM Parallel Environment imposes a hard limit of 64 kb on the eager limit [8]. A more scalable method of increasing the range of overlap is to allow the SP's communication subsystem to interrupt execution [9]. If the environment variable MP CSS INTERRUPT is set to yes, the communication subsystem will interrupt computation to send or receive s. The default setting is no. Figure 7 models how the vendor-supplied MPI implements synchronization overlap when interruption is allowed. The MPI Recv causes the communication subsystem to interrupt the sending process, and a communication thread becomes active and completes the send. Computation resumes once the is sent. C. MPT for IRIX on the SGI Origin2000 The SGI Origin2000 is a multiprocessor with one system image and physically distributed, logically shared memory [10]. By default, the vendor-supplied MPI implementation on the CEWES MSRC Origin supports synchronization overlap for s up to about a megabyte. Other Origins may have dierent defaults. Within a single Origin system, all MPI s move through shared buers [4]. Figure 8 models how the vendor-supplied MPI implements synchronization overlap when the ts within the shared buer. The

shared buffer Fig. 8. Synchronization overlap on the SGI Origin2000 for s tting within the shared buer. shared buffer Fig. 9. Incomplete synchronization overlap on the SGI Origin2000 for s larger than the shared buer. MPI Isend copies the to the shared buer, and the MPI Recv copies it back out. Figure 8 models how the vendor-supplied MPI implements passing when the does not t within the shared buer. The MPI Recv does not complete until the MPI Wait transfers the remaining blocks of the through the shared buer. As a side eect, the two processes must synchronize. A straightforward method for increasing the range of overlap is simply increasing the size of the shared buers using the MPI BUFS PER PROC environment variable, where the value of the variable indicates the number of memory pages [4]. Unlike eager-buer space on the IBM SP, the total shared-buer space on the Origin varies with the number of processes, not with the square of that number. Therefore, this method of increasing the range of overlap does not suer from such poor scalability. D. Summary Each of the tested implementations support the overlap of computation and synchronization over a wide range of sizes, either by default or with minor changes to the environment. Therefore, all the implementations should support the ecient execution of asynchronous applications. The experiment itself is asynchronous to an extreme. The benets, if any, of synchronization overlap for synchronous applications are not so obvious. III. Overlap of Computation with Data Transfer MPI implementations that support synchronization overlap may or may not support data-transfer overlap. Support for synchronization overlap can fall into three categories according to the type of data transfer: two sided, one sided, and third party. An MPI implementation may support synchronization overlap without any support for data-transfer overlap. This corresponds to two-sided data transfer, where one process must interrupt the execution of another process for transfer to occur. This interruption can be avoided in implementations that support one-sided data transfer. In this case, only one of the two processes sharing a must be involved in data transfer at a time. The other process is free to continue computing, resulting in a limited form of data-transfer overlap. In contrast, third-party data transfer allows full datatransfer overlap. The computing processes operate concurrently with a third party that handles the communication. The experiment described in the previous section does not dierentiate between these three possible implementations of synchronization overlap or the corresponding levels of data-transfer overlap. The experiment does not directly measure the time for data transfer. Even if it did, this time would often be orders of magnitude smaller than the computation time. To dierentiate between implementations of synchronization overlap and to test support for data-transfer overlap, we employ a larger experiment based on a semiconductor device simulation. This application is described in detail in [11]. It uses an unstructured two-dimensional grid that is partitioned among parallel processes, and it relies on a Krylov subspace iterative solver. The runtime is dominated by the many sparse matrix-vector multiplications executed by the solver. Each parallel multiplication requires the communication of values at the interfaces of the distributed grid partitions. With carefully chosen partitions, the computation and communication are both load balanced, resulting in a synchronous application. The experiment usestwo versions of this application that dier only in the implementation of the parallel matrixvector multiplication. One version has separate phases of communication and computation, and the other tries to overlap communication with computation. Figure 10 presents a schematic of the version with separate phases of communication and computation, the nonoverlap version. Each process rst computes its share of the matrix-vector multiplication. It then issues MPI Irecvs and MPI Isends to share partition-interface values, followed immediately by an MPI Waitall that blocks until communication completes. Figure 11 presents a schematic of the version of parallel multiplication designed to overlap communication with computation. The computation is split into two pieces. Each processor rst calculates the multiplication results

Computation MPI_Irecv s s all Fig. 10. Parallel matrix-vector multiplication with separate phases of communication and computation. Boundaries MPI_Irecv s s Interior Computation all Fig. 11. Parallel matrix-vector multiplication programmed explicitly to overlap communication with computation. for the points at the interfaces of its partition of the grid. The bulk of the computation, the interior of the partition, occurs after the MPI IrecvsandMPI Isends but before the MPI Waitall. The intent of this design is to allow communication to occur concurrently with the interior computation. A. Expected results Common wisdom suggests that the overlap version of parallel sparse matrix-vector multiplication should provide better performance [2]. The caveat in [2] of \suitable hardware" should not be underestimated, however. The degree to which any performance improvement is realized depends on the level of overlap supported by the MPI implementation and the underlying system. All the implementations tested here support synchronization overlap, but they dier in their specic implementation of this overlap. Consider what might happen with the parallel multiplication experiment for synchronization overlap that uses two-sided communication. This case corresponds to the MPI implementation on the IBM SP for s larger than the eager limit when MP CSS INTERRUPT is set to yes. In both versions of the application, no data-transfer overlap can occur for any one. The runtime is at least the sum of computation time and maximum per-process data-transfer time. Any advantage of the overlap version is a higher-order eect related to inherent asynchrony of the application load imbalance in the communication pattern may cause synchronization overhead that the overlap version can avoid. For a primarily synchronous application such as this, the performance improvement should be minimal. A similar result is likely for synchronization overlap that uses one-sided communication. Though one-sided communication can provide partial data-transfer overlap, the synchronous nature of the application avoids overlapping data transfer with actual computation. In both versions of the application, all the processes communicate at roughly the same time. Therefore, each one-sided data transfer overlaps with other one-sided data transfers, not with computation. Again, any advantage of the overlap version is a higher-order eect, and performance improvement should be minimal. All the tested MPI implementations correspond to this one-sided case for some range of sizes: under the eager limit on the SP, smaller than the shared buers on the Origin, and smaller than available memory on the T3E. For no sizes, however, do any of the tested implementations correspond to the remaining category of synchronization overlap, where data transfer is handled by a third party. Yet it is just this category that oers the greatest promise for the overlap version of the application. In the nonoverlap version, the processes block at the MPI Waitall while the third party completes the communication. Conversely, the overlap version performs useful computation while the third party transfers data. If the computation and communication take comparable time, the overlap version can take as little as half the time of the nonoverlap version. B. Actual results Table I presents the results for the performance improvement of the overlap version of the test application over the nonoverlap version. The results are for a specic data set of about 10 000 grid points run on 16 processors. Other problem and system sizes found in [11] show similar results. The performance improvement is calculated as follows, where T overlap and T nonoverlap are the execution times for the overlap version and nonoverlap version, respectively: Improvement= T overlap ; T nonoverlap T nonoverlap 100%: As predicted, the overlap version shows little improvementover the nonoverlap version on the SP and the Origin. On the T3E, however, the overlap version runs dramatically faster, contradicting the prediction. More surprisingly, the improvement of almost 80% is signicantly higher than the theoretical peak improvement of 50% given by perfect overlap!

TABLE I Performance Improvement of Programming for Overlap MPI implementation Improvement IBM PE on IBM SP 0:4% MPT for IRIX on SGI Origin2000 5:3% Cray MPT 1:2:0:1 on CRAY T3E 78:6% TABLE II Comparison of MPI Implementations on the CRAY T3E Implementation T nonoverlap T overlap Improvement MPT 1:2:1:1 55:61 14:85 73:3% MPT 1:2:1:2 15:32 14:59 4:1% MPICH 12:48 12:54 ;0:4% Times are inseconds. Table II presents newer results for the T3E that include dierent implementations of MPI: MPT 1:2:1:1, MPT 1:2:1:2, and the MPICH implementation available from Mississippi State University [12]. Unlike MPT1:2:1:1, MPT 1:2:1:2 validates the prediction by almost eliminating the advantage of the overlap version. MPICH goes a step further by completely eliminating this advantage, in addition to providing better absolute performance. The dramatic \improvement" shown by MPT 1:2:0:1and MPT 1:2:1:1 appears to be a result of a bug that has been removed from MPT 1:2:1:2. IV. Conclusions MPI implementations provide dierent levels ofsupport for the overlap of computation with communication. The implementations tested here all support the overlap of computation with synchronization, a capability particularly useful for asynchronous parallel applications. In contrast, these MPI implementations provide only limited support for the overlap of computation with data transfer. This limitation has important implications for applications programmed explicitly for overlap. For synchronous applications, \programming for overlap" actually implies a specic level of overlap: full data-transfer overlap. This level is only provided by implementations using third-party communication, not two-sidedoreven one-sided communication. The parallel systems and MPI implementations tested here are common targets for large-scale parallel computations. None of them provide third-party communication, and none show signicant performance improvement for the overlap experiment modeled in gures 10 and 11. This negative result shows that programming explicitly for overlap is clearly not a portable performance enhancement for the synchronous application used in the experiment. According to the same arguments used to predict this result, programming for overlap may be of little benet for synchronous applications in general. Many current MPI implementations all those tested here do not support the required level of overlap. The additional development eort and code complexity required to program for overlap seem unjustied. Acknowledgments The authors are grateful to Purushotham Bangalore and Shane Hebert of Mississippi State University for providing insight into the mechanics of MPI implementations. This research was sponsored in part by the DoD High Performance Computing Modernization Program CEWES Major Shared Resource Center through Programming Environment and Training. Views, opinions, and/or ndings contained in this report are those of the authors and should not be construed as an ocial Department of Defense Position, policy, or decision unless so designated by other of- cial documentation. References [1] W. Gropp, E. Lusk, and A. Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface, The MIT Press, 1994. [2] Message Passing Interface Forum, MPI: A Message-Passing Interface Standard, University oftennessee, Knoxville, 1995. [3] Performance of the CRAY T3E Multiprocessor, Silicon Graphics, 1999. [4] Message Passing Toolkit: MPI Programmer's Manual, Cray Research, 1998. [5] CRAY T3E Programming with Coherent Memory Streams, Version 1:2, Cray Research, 1996. [6] J. White III, Using the CRAY T3E at OSC, Ohio Supercomputer Center, 1998. [7] H. Kitzhofer, A. Dunshea, F. Mogus, RS/6000 SP Performance Tuning, IBM, 1998. [8] IBM Parallel Environment for AIX: MPI Programming and Subroutine Reference, Version 2, Release 3, IBM, 1997. [9] IBM Parallel Environment for AIX: Operation and Use, Volume 1: Using the Parallel Operating Environment, IBM, 1997. [10] D. Cortesi, Origin2000 and Onyx2 Performance Tuning and Optimization Guide, Silicon Graphics, 1998. [11] S. Bova and G. Carey, \A Distributed Memory Parallel Elementby-Element Scheme for Semiconductor Device Simulation," Computer Methods in Applied Mechanics and Engineering, in press. [12] L. Hebert, W. Seefeld, and A. Skjellum, MPICH on the CRAY T3E, PET Technical Report 98-31, CEWES MSRC, 1998.