Low Latency MPI for Meiko CS/2 and ATM Clusters

Size: px

Start display at page:

Download "Low Latency MPI for Meiko CS/2 and ATM Clusters"

Karen Norris
6 years ago
Views:

1 Low Latency MPI for Meiko CS/2 and ATM Clusters Chris R. Jones Ambuj K. Singh Divyakant Agrawal y Department of Computer Science University of California, Santa Barbara Santa Barbara, CA Abstract MPI (Message Passing Interface) is a proposed message passing standard for development of efficient and portable parallel programs. An implementation of MPI is presented and evaluated for the Meiko CS/2, a 64 node parallel computer, and a network of 8 SGI workstations connected by an ATM switch and Ethernet. 1. Introduction A major hurdle of the message passing paradigm has been the lack of a standard library supporting features such as point-to-point and collective communication, tagged message delivery, and synchronization primitives. This motivated a consortium of researchers and practitioners in the parallel computing arena to develop a standard for message passing, Message Passing Interface (MPI) [12]. Rather than adopt one of the existing message passing libraries, they chose to create a standard with all of the above features by integrating the features provided by existing message passing libraries such as PVM [8], NX/2 [9], p4 [6] and PAR- MACS [7]. Implementations of high-level message passing libraries such as MPI are often significantly less efficient than lower level libraries because of their support of high-level programming features, and their failure to exploit specific architectural details. In implementing message passing libraries, implementors often have to trade off between low latency for small messages and high bandwidth for large. This paper examines the necessary overheads in implementing the MPI library on top of existing user level libraries for two platforms: a 64 node Meiko CS-2, and a cluster of Silicon Graphics workstations connected by both Ethernet and an ATM (Asynchronous Transfer Mode) switch. The focus This research was supported by the NSF under CCR , CDA , and CCR y Work supported by LANL under UC94-B-A-223 and by a research gift from NEC Japan. of the research is on reducing latencies in a tagged message passing model of MPI. In MPI, a sending process issues a send operation, which eventually transfers data to a process that issues a matching receive operation. Each process might issue several sends or several receives, so the processes are responsible for matching corresponding send and receive operations. A message communication in MPI therefore involves each send operation sending its envelope to the receiver, the matching of these envelopes to receive tags, and the eventual transfer of data from the sender's buffer to the receiver's buffer. This paper examines methods for the sending of envelopes, the matching of sends and receives, and the sending of data. To minimize latency, a mechanism is proposed which overlaps the transfer of data and send envelopes, buffering data temporarily when necessary at the receiver. However, for large amounts of data, the temporary buffering becomes a major overhead and can limit bandwidth. So, we propose a hybrid implementation in which, above a certain message size threshold, the sending and matching of envelopes occurs first, and then a DMA (direct memory access) from the sender to the receiver is initiated without intermediate buffering. 2. Related Work MPICH [2] is an implementation of MPI designed for portability that provides a well-defined device layer allowing for easy implementation on new architectures. It has been implemented for a variety of architectures through the widely portable p4 [6] communications layer, and also includes specialized device layers for many parallel machines, including the Meiko CS2. xmpi [15] extends MPICH by providing a device driver on top of xkernel, a framework for implementing network protocols in the kernel more efficiently than existing frameworks, such as STREAMS. As xkernel can be used to implement kernel level network protocols, it could provide more efficient overall implementations than other kernel level protocols, such as those described in this paper. The authors propose an MPI imple-

2 mentation over ATM AAL5. The Meiko CS2 has specialized hardware for providing secure user level communication using a high speed network [13]. The Meiko implementation of MPI included with the MPICH distribution is based on a tagged message passing widget provided by Meiko, called the tport widget. The tport widget, however, trades off latency in providing high bandwidth communication, and provides no support for collective communications, requiring that MPICH implement collective communications on top of point-to-point mechanisms. LAM [5] is a parallel software environment that implements MPI on top of its own existing features. Those MPI features which have no close match are independently implemented. Like MPI, it provides a rich environment for parallel programming. The MPI extension allows users to use LAM features as well. Efficient implementation of the MPI collective communications library on top of Ethernet is examined by Bruck, Dolev, Ho, Rosu and Strong [3]. A user level reliable transport protocol is given which uses the broadcast nature of the Ethernet for efficiency. Like the broadcast mechanism discussed in this paper, the exploitation of hardware broadcast gives a more efficient implementation than would be possible using only point-to-point communication. MPI's message delivery guarantees can be unrealizable due to limited resources for message envelopes. Burns and Daoud [4] discuss tactics for overflow detection and reporting in MPI implementations. Basu, Buch, Vogels and Von Eicken [1], as well as Thekkath, Nguyen, Moy and Lazowska [17] discuss the inefficiency of implementing networking protocols in the kernel. They discuss methods for moving parts of the networking protocol into the user level, leaving only the necessary security mechanisms in the kernel. Both papers give implementations of user level networking protocols for ATM, showing significant improvements over kernel level implementations. Implementation of a minimal latency DMA implementation for the ATM is given by Thekkath, Levy and Lazowska [16]. The DMA mechanism is the heart of communication for the Meiko, so a DMA mechanism such as this could be used in conjunction with the Meiko implementation discussed in this paper for a high performance ATM implementation. 3. MPI Standard MPI [12] is a communication standard to support development of portable parallel programs. The formal MPI specifications define the following primitives: point-topoint communication, collective communication, process group management and virtual topology management, environmental management and a profiling interface. In this paper, we restrict our attention primarily to the point-to-point communication primitives: MPI_Send(buffer, count, datatype, dest, tag, communicator) MPI_Recv(buffer, count, datatype, source, tag, communicator, status) There are several variants of the MPI Send call: a buffered mode, synchronous mode, and ready mode, each of which has its own blocking and nonblocking variant. The buffered send, MPI Bsend, ensures that the completion of the send operation does not depend on the posting of the receive operation. Once the buffered send has been posted, it will complete regardless of the activities of the receiver. The user is responsible for providing the buffer space for the buffered send. As long as this buffer space is monitored to make sure space is available, the send is guaranteed to be buffered correctly. The synchronous mode send, MPI Ssend, adds the requirement that the sender must wait until the receiver has posted the matching receive, and the send and receive have completed, with the sender's data having been successfully transferred to the receiver's buffer. This allows the sender to know the exact point at which the data transfer occurred. The ready mode send, MPI Rsend, allows the programmer to give more information to MPI, by informing it that the sender knows that the receiver has already posted a receive. As far as receive operations go, there are no special modes; all receives match any type of send. However, there is an operation similar to MPI Recv, MPI Probe, which polls for incoming messages without actually completing the data-transfer. A nonblocking version of this operation, MPI IProbe, checks to see if a message is available without blocking. The MPI Bcast call supplies a broadcast mechanism, used when a single processor wishes to send data to all other processes in a group. We have implemented MPI point-to-point and the broadcast collective communication primitive on the Meiko and the ATM cluster. The implementation of broadcast on Meiko uses the underlying hardware broadcast mechanism, whereas on the ATM network, it uses a succession of point-to-point messages. 4. Implementation of MPI on Meiko 4.1. Design Motivations As discussed above, MPI point-to-point communication involves transmitting a send envelope, matching a send with a receive, and transferring data from the sender's buffer to the receiver's buffer. There are several choices as to how and when these operations are performed. We discuss some

3 of these options next. The ability of a receive operation to specify MPI ANY SOURCE requires that the matching be done at the receiver. Waiting for the data transfer until a match will result in high latencies, especially for small messages. This is because the send's envelope must first be sent to the receiver, a match must be made at the receiver, a second trip across the network is necessary to request the data from the sender, and finally another trip is necessary for the sender to actually send the DMA data. If the extra latency is avoided by initiating a transfer simultaneously with a match operation, then temporary buffers will be needed, thus increasing the space requirements. We use a hybrid approach in which the transfer of small messages is overlapped with the matching operation, and the transfer of large messages occurs after the match operation. The crossover point is determined by examining the various latencies. As far as the matching operation at the receiver goes, we still need to decide whether to use the SPARC (the main 40 MHz processor on a Meiko node) or the Elan (10 MHz communications co-processor) to perform the matching. If we are to match sends and receives only with the SPARC processor, data transfers will not complete until the application program issues an MPI Wait or MPI Test operation on a nonblocking operation. For instance, if the receiver has issued a nonblocking operation, and the sender issues a synchronous send, the sender will have to wait for the receiver to issue a completion operation on its receive. To handle this problem, we could utilize the Elan co-processor for matching of sends and receives in the background. However, the slower Elan may not be able to handle the somewhat intensive message matching as quickly as the faster SPARC, so latency could increase. Additionally, since the SPARC must eventually be informed of the completion of a receive, the extra synchronization between the Elan and SPARC can also increase latency. The existing ANL/MSU MPICH implementation [2] on top of tport widget uses the Elan for matching in the background. For the sake of comparison, we implemented our matching of sends and receives on the SPARC. Regardless of matching at the receiver, we would also like the sending processor to be able to issue nonblocking sends very quickly, freeing the processor for other responsibilities. So, we use the Elan to perform this sending in the background, allowing the SPARC to perform other computation. The actual transfer of data from the sender requires allocation of memory at the receiver. The expense of these mechanisms has been researched in detail in an implementation of Active Messages for the Meiko CS2 [14]. In order to minimize latency, we allocate space for a single send envelope for each sending processor at each receiver. Round-trip time (us) Round-trip time (us) Buffering No buffering Figure 1. Meiko transfer mechanisms MPI(mpich) MPI(low latency) Meiko tport Figure 2. Meiko round-trip latency 4.2. Meiko results As mentioned before, data can be optimistically transferred before the match to a buffer at the receiver, or after the match is made directly to the receiver's address space. The comparison of these two mechanisms is shown in detail in Figure 1. The intersection of the two lines occurs at a message length of 180 bytes. Based on this we choose 180 bytes as the crossover point between the two implementations. The mechanism for small transfers allows much lower latencies than the tport mechanism for certain applications. Figure 2 shows the round-trip latency times for varying transfer sizes. Three plots are shown: our MPI implementation where matching is done entirely on the SPARC, the MPICH tport-based MPI implementation which does matching mostly on the Elan, and a plot of Meiko's tport widget without any MPI overheads. As Meiko's tport widget provides simplified tagged message passing directly, it is the lowest latency mechanism, with a 1 byte round-trip latency of 52 s. The two MPI implementations add significant overheads to this minimal cost by adding the MPI features of communicators, datatypes and different modes of communication. As

4 40 35 MPI(mpich) MPI(low latency) Meiko tport TCP UDP Fore aal4 Throughput (MB/s) Round-trip time (us) Figure 3. Meiko bandwidth Figure 4. ATM round-trip latency MPICH implements MPI directly on top of the tport widget, it can be seen that MPICH adds 158 s to the 1 byte roundtrip latency. Our lower latency SPARC matching implementation decreases these overheads by providing an implementation directly over Meiko DMAs and transactions, decreasing latency significantly with a total round-trip time of 104 s, about 52 s higher than the tport widget. The bend in the curve for the SPARC matching MPI implementation which is visible around 180 bytes clearly shows where the high bandwidth mechanism takes over, and DMAs are invoked by the receiver after the matching of the send and receive has taken place. To demonstrate that this implementation still maintains high bandwidth for large transfers, Figure 3 plots bandwidth versus message size for larger messages with the two MPI implementations and the Meiko tport. As seen in the figure, the best possible DMA provided bandwidth of 39MBytes/s is nearly reached, and bandwidth is in fact increased as a result of decreasing latency for the SPARC matching implementation. 5. ATM Implementation The ATM cluster examined here consists of eight SGI Indy 133-MHz workstations and an SGI Challenge SMP 150-MHz dual-processor. All of these machines have 64 MBytes of RAM, and are connected via a 10 Mbit/s Ethernet and 155 Mbit/s ATM channels. The ATM switch is a Fore Systems ForeRunner ASX-200 with eight 155 Mbit/s ports. Each SGI is connected to the switch by a Fore GIA- 200 interface card. These interface cards include an Intel i960 processor dedicated to segmentation and reassembly for the AAL3/4 and AAL5 protocols without using the main processor. Four different user level protocols are available to run in this environment. TCP/IP is available with two different signaling protocols to establish the connection. The first, Classical IP, is the standard signaling protocol defined by the ATM Forum. The second, SPANS, is Fore's own signaling protocol. The differences in these protocols only affect connection establishment. In the implementations we will examine, connections are static, so connection setup time is not of major importance. Hence, this paper only examines classical IP. Fore also offers an API which provides communication on top of ATM adaptation layers 3 and 4 (which are treated identically), and AAL5. As our goal is a low latency implementation of MPI, it is worth considering the latency of a packet over these different protocols. We would expect that the Fore adaptation layers might be significantly faster, since they have very few overheads. Unfortunately, they are not significantly faster than the Fore TCP or UDP implementations. This is because of the overheads involved in the streams protocol layers used by the Fore API. Figure 4 shows a comparison of the latencies for Fore's implementation of AAL4, TCP and UDP. Except for small message sizes, the latency of these protocols are indistinguishable from each other. This prompted us to confine our attention only to TCP/IP and UDP as the underlying communication protocols. Our measurements were made using TCP/IP and UDP on an Ethernet as well as the described ATM network TCP Implementation The facilities for communication provided by TCP/IP and the Meiko are quite different. While Meiko provides mechanisms for manipulation of remote data through DMAs or remote transactions, TCP/IP provides a much different communication mechanism, that of a reliable stream of data between two processes on opposite ends of a channel. However, many of the issues we confronted during the implementation of MPI for the Meiko are common, such as the need to match MPI Sends to MPI Receives. In order to reuse the components of our MPI implementation on the Meiko, we decided to implement the underlying primitives (assumed in the Meiko implementation) on top of TCP.

5 Essentially, we needed to implement a method of sending an envelope, a method of sending an envelope with piggybacked data, and a method for setting remote events and sending DMA data. As the communication latencies are quite large when TCP is used, piggybacking data is more important than in the Meiko implementation. As such, the buffer flow control mechanism used in the Meiko implementation is inappropriate, since it assumes only a single outstanding message at any given time. A window protocol would be ideal allowing multiple outstanding messages. Unfortunately, since not all MPI messages can be ordered on the same FIFO because of tags and communicators, standard window protocols are inappropriate. So, we have the receiver keep a reserved amount of memory for each sender, to which the sender sends data optimistically. Once freed, the receiver informs the sender that the space can be reused. This allows the sender to optimistically send many messages as long as it knows that free space is available at the receiver TCP Results Roundtrip time (us) mpi/tcp/atm mpi/tcp/eth tcp/atm tcp/eth Figure 5. TCP round-trip latency Figure 5 displays the round-trip latency times for messages with both TCP and MPI over TCP. Four plots are shown, two for TCP and MPI over Ethernet and two for ATM. There is approximately a 150 s higher round-trip latency for MPI implementations over TCP. This overhead is caused by the additional transfer of envelopes and control information, and the cost of performing message matching. Table 1 displays the breakdown of these overheads for a 1 byte message. The largest overhead is the 925 s costof a round-trip message over Ethernet, and 1065 s foratm. The next line in the table is the cost of sending 25 bytes of MPI protocol information. (Of the 25 bytes, 1 byte designates the type of message, such as envelope, or DMA. 4 bytes are included for telling the destination how much reserved space has been freed. The last 20 bytes are used for the envelope, and DMA request information.) The amount of this overhead is 45 s on Ethernet and about 5 s on ATM. The next line measures the overhead of determining the incoming message type. This was measured to be 65 s on Ethernet and slightly higher (85 s) on ATM. The next line measures the overhead of receiving the actual envelope once the message type has been determined. Once more, they are 65 s on Ethernet and 85 s on ATM. These costs are so high because the associated operations need to cross the kernel boundary. The last 35 s overhead is the cost of actually performing MPI matching. (The cost for receiving the actual data is already included in the latency figures of the first line.) Figure 6 displays the bandwidth obtained using TCP as the communications layer. For the sake of completeness, we implemented MPI by using the UDP transport level interface. The UDP imple- Throughput (MB/s) mpi/tcp/atm mpi/tcp/eth tcp/atm tcp/eth Figure 6. TCP bandwidth ATM Ethernet Overhead 1065 s 925 s 1 byte round-trip latency 5 s 45 s 25 byte info overhead 85 s 65 s Read for msg type 85 s 65 s Read for envelope 35 s 35 s Overheads for matching Table 1. MPI round-trip overheads with TCP

6 1 0.9 mpich low latency mpich low latency time (s) time (us) # Processes Number of processors Figure 7. Meiko Linear Equation Solver Figure 8. Meiko Particle Pairwise Interactions mentation is very similar to the one with TCP with additional measures taken to make the UDP communication reliable. As a consequence of the overhead that arises to make the UDP connection reliable, our results indicated that the performance of the UDP implementation was very similar to that of TCP [10]. Time (us) Ethernet ATM 6. Applications Linear Equation Solver A linear equation solver for N variables has been implemented which solves the equation with an initial phase of computation by the initiator, N phases of broadcasting and computation by all processes, and a final phase of result gathering by the initiator. As the only communication mechanism involved here is the broadcast, the MPI-based program uses the collective communication primitives implemented using Meiko's hardware broadcast mechanism. The results for the Meiko are shown in Figure 7, which givestimes forsolvinga linear system using from 1 to 32 processes. Two plots are show, one for the MPICH implementation, which implements broadcast using pointto-point messages, and our implementation. We also implemented matrix multiplication; the performance results are similar to that of the linear equation solver Pairwise Interactions Another problem which exploits the use of a parallel computer well is molecular dynamics, where we need to compute interacting forces within a group of particles. As each particle interacts with every other particle in the group, O(n 2 ) interactions must be calculated. To parallelize this problem, each processor is in charge of calculating the interactions of P=N particles where N is the numberof processors. The processes communicate in P, 1 phases, passing Number of Processors Figure 9. TCP Particle Pairwise Interactions a partition of the particles around in the ring. Each process calculates the interactions between the particles it is permanently assigned to and the partition of particles that it has in the current round. Messages are simply passed in a ring, requiring only point-to-point messages. To allow concurrent sending and receiving at the communication phase of each round, nonblocking sends are posted to send to the next processor in the ring, then a blocking receive is performed, followed by a wait operation to complete the send. Figure 8 shows the results of running this program to find the forces acting on each of 24 particles, using up to 8 processes of the Meiko. As each processor has a fairly even load, the processes tend to interact at nearly the same time, so a lower latency communication mechanism is beneficial. The high latencies of TCP on the cluster of workstations make this problem scale well only for much larger problems sets. Figure 9 shows the results of finding the forces acting on 128 particles. The ATM shows a clear performance gain, primarily because there is no network contention and fairly large messages are used, exploiting ATM' s higher bandwidth.

7 7. Conclusions The MPI standard is intended to provide a widely portable message passing interface. Its semantics leave many efficiency issues to the implementor. We have shown three elements that can affect performance on different architectures: the use of a communications co-processor for background processing, the buffering of data at the receiver, and the overlapping of data transfer with message matching. A mechanism was proposed that takes these issues into account, giving efficient implementations on the Meiko CS-2, and a cluster of workstations communicating with TCP or UDP on Ethernet or ATM. References [1] Basu, A., Buch, V., Vogels, W., and Von Eicken, T., U- Net: A User-Level Network Interface for Parallel and Distributed Computing Proceedings of the 15th ACM Symposium on Operating Systems Principles, December [2] Bridges, P., Doss, N., Gropp, W. Karrels, E., Lusk, E., and Skjellum, A., Users' Guide to MPICH: A Portable Implementation of MPI, Argonne National Laboratory, May [3] Bruck, J., Dolev, D., Ho, C., Rosu, M. and Strong, R., Efficient Message Passing Interface (MPI) for Parallel Computing on Clusters of Workstations ACM Symposium on Parallel Algorithms and Architectures, July [4] Burns, G. and Daoud, R. Robust MPI Message Delivery with Guaranteed Resources Proceedings of the MPI Developers Conference, June [9] ipsc/2 and ipsc/860 User' s Guide, Intel Corporation, Order Number , April [10] Jones, Chris. Low Latency MPI for Meiko CS-2 and ATM Clusters, MS Thesis, Department of Computer Science, University of California at Santa Barbara, July [11] Lin, M., Du, D. H. C., Thomas, J. P. and McDonald, J. A. Distributed Network Computing over Local ATM Networks. IEEE Journal on Selected Areas in Communications, Vol. 13, No. 4, May [12] Message Passing Interface Forum, MPI: A Message- Passing Interface Standard, Version 1.1, June [13] CS-2 Documentation Set, Meiko World Incorporated, [14] Schauser, K. and Scheiman, C., Experience with Active Messages on the Meiko CS-2 9th International Parallel Processing Symposium, Santa Barbara CA, April [15] Singhai, A. and Campbell, R. H. xmpi - An Implementation over x-kernel for ATM Networks Proceedings of the MPI Developers Conference, June [16] Thekkath, C. A., Levy, H.M. and Lazowska, E. D., Efficient Support for Multicomputing on ATM Networks. Technical Report TR Department of Computer Science and Engineering, University of Washington, April [17] Thekkath, C.A., Nguyen, T. D., Moy, E., and Lazowska, E.D., Implementing Network Protocols at User Level. IEEE/ACM Transactions on Networking. Vol 1. No. 5, October [5] Burns, G., Daoud, R. and Jaigl, J., LAM: An Open Cluster Environment for MPI, Ohio Supercomputing Center, [6] Butler, R., and Lusk, E., User's guide to the p4 parallel programming system. Technical Report ANL-92/17, Argonne National Laboratory, [7] Calvin, R., Hempel, R., Hoppe, H., Wypior, P. Portable programming with the PARMACS Message- Passing Library. Parallel Computing, Special Issue on Message Passing Interfaces 20 (April 1994), [8] Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Manchek, R., and Sunderam, V., PVM 3.0 User' s Guide and Reference Manual. Oak Ridge National Laboratory. Technical Report ORNL/TM , February 1993.

AN O/S PERSPECTIVE ON NETWORKS Adem Efe Gencer 1. October 4 th, Department of Computer Science, Cornell University

AN O/S PERSPECTIVE ON NETWORKS Adem Efe Gencer 1 October 4 th, 2012 1 Department of Computer Science, Cornell University Papers 2 Active Messages: A Mechanism for Integrated Communication and Control,