Clusters. Mario Lauria. Dipartimento di Informatica e Sistemistica. Universita di Napoli \Federico II" via Claudio Napoli, Italy

Size: px

Start display at page:

Download "Clusters. Mario Lauria. Dipartimento di Informatica e Sistemistica. Universita di Napoli \Federico II" via Claudio Napoli, Italy"

Russell Payne
5 years ago
Views:

1 MPI-FM: High Performance MPI on Workstation Clusters Mario Lauria Dipartimento di Informatica e Sistemistica Universita di Napoli \Federico II" via Claudio Napoli, Italy lauria@nadis.dis.unina.it. Andrew Chien Department of Computer Science University of Illinois at Urbana-Champaign 1304 W. Springeld Ave. Urbana, IL 6101, USA achien@cs.uiuc.edu Abstract Despite the emergence of high speed LANs, the communication performance available to applications on workstation clusters still falls short of that available on MPPs. A new generation of ecient messaging layers is needed to take advantage of the hardware performance and to deliver it to the application level. Communication software is the key element in bridging the communication performance gap separating MPPs and workstation clusters. MPI-FM is a high performance implementation of MPI for networks of workstations connected with a Myrinet network, built on top of the Fast Messages (FM) library. Based on the FM version 1.1 released in Fall 1995, MPI-FM achieves a minimum oneway latency of 19 s and a peak bandwidth of 17.3 MByte/s with common MPI send and receive function calls. A direct comparison using published performance gures shows that MPI-FM running on SPARCstation 20 workstations connected with a relatively inexpensive Myrinet network outperforms the MPI implementations available on the IBM SP2 and the Cray T3D, both in latency and in bandwidth, for messages up to 2 KByte in size. Visiting at time of writing 1

2 We describe the critical performance issues found in building a high level messaging library (MPI) on top of a low level messaging layer (FM), and the design solutions we adopted for them. One such issue was the direct and ecient support of common operations like adding and removing a header. Another was the exchange of critical information between the layers, like the location of the destination buer. These two optimizations are both shown to be necessary, and their combination sucient to achieve the aforementioned level of performance. The performance contribution of each of these optimizations is examined in some detail. These results delineate a new design approach for low level communication layers in which a closer integration with the upper layer and an appropriate balance of the communication pipeline stages are the key elements for high performance. 2

3 1 Introduction Growing interest in the use of networks of workstations to perform high performance computation has been spurred by the remarkable growth of their computational performance. Moreover, when compared to their closest competitor (massively-parallel processors based on commodity processors) workstations have other advantages. They are relatively cheap, and their large sales volume attracts investments directed to their rapid improvement. Software is abundant, readily available, and has a large base of established users. The use of networked machines to do distributed computing is not new, and a number of communication libraries which use TCP/IP over Ethernet have been around for some time now (BSD sockets [14], PVM [21]). However, Ethernet and its associated networking protocols were not designed for high performance computing, and their limitations when used for this purpose severely restrict the range of applications that can be run and achieve good parallel performance. Communication protocols require a number of services to work, like timers, buer management, process protection, and process notication. In traditional protocol implementations, these services are provided by the operating system. But the convenience of relying on the operating system is paid for in terms of additional copies between address spaces, and by the context switch occurring at each system call [7], both of which contribute to overhead and reduce performance. As a result, on a network of workstations, the observed communication latency over TCP/IP is on the order of one half to one millisecond. In the same time interval, a typical workstation RISC processor can execute tens to hundreds of thousands instructions. Workstation clusters have recently become attractive for high performance computation due to the introduction of new communication technologies with much improved performance. The fast Local Area Networks (LANs) available today (ATM [5], FDDI [], Fibrechannel [1], Myrinet [3]) are, in terms of latency and bandwidth, comparable to the proprietary interconnect found on MPPs. However, without a radical change in the way communication protocols are implemented, applications will not be able to reap the benets of this new technology. Existing protocol implementations have been shown to achieve only modest performance improvements when used on new network hardware [15]. Our solution is to build new communication software, designed from the start with the objectives of low latency and high bandwidth communication. In the context of the Fast Messages (FM) project [17], we selected the Myrinet network due to its performance, programmability and price/performance ratio. On this network we wrote the FM library, a highly optimized, low latency messaging layer providing a virtual interface for the hardware [17]. The lowest layer of the communication software often loses most of the raw performance, due to architectural barriers (I/O bus) and to the large dierence in abstractions (hardware device vs. programming interface). So in this rst part of the project the goal was to minimize the performance loss. All choices, including the decision as to which services to include in the interface, were driven by performance considerations. The design of FM addresses some of the critical issues found in building a low level messaging layer: division of labor between the host and the network coprocessor, management of the input/output (I/O) bus, and buer management. Implemented entirely in user space, 3

4 FM avoids the high overhead of system calls. By providing a few key services { buer management, reliable and in-order delivery { the FM programming interface allows for a leaner, more ecient implementation of the higher level communication layers. FM achieves a short message latency of only 14 s and a peak bandwidth of 17.6 MB/s, with an Active Messages style interface. The rst part of the research project is complete and is documented in [17], and only a few details of the FM interface will be presented here. The second phase of the project constitutes the object of this work, and will be described in greater detail. After this rst phase, the problem of optimizing the FM interface was tackled. For this purpose, another communication layer, the MPI library, was built on top of FM to close the gap toward the user level. Then the entire communication hierarchy was studied to gain a better understanding of the origins of the software overhead. Once identied, the major sources of overhead were removed by modifying as required the FM interface. As a result, only those services that were shown to be strictly indispensable in reducing the overhead, and that could be implemented without substantial performance degradation, made their way into FM. One of the changes we made to FM was adding a simple gather, to support common operations like adding and removing headers. Another was the inclusion of an upcall to allow the exchange of critical information between the layers, like the location of the destination buer. Most importantly, while all these optimizations individually contributed to performance, only their simultaneous application exposed all their potential benet, revealing the importance of balancing the messaging layer across all the processors involved { sender, receiver, network interfaces. The performance achieved with the nal version of MPI-FM is 19 s 0-byte latency and 17.3 MB bandwidth. When compared with MPI-F, the optimized MPI implementation for the IBM SP2, MPI-FM shows better performance for message sizes of 2 KB or less. We were unable to nd any results for network of workstations that provided comparable performance. A current implementation of MPI on TCP/IP is typically two orders of magnitude worse for latency, and one order for bandwidth, so the comparison is hardly interesting. To the best of our knowledge, MPI-FM is the fastest implementation of MPI available for workstations at the time of writing (Spring 1996). The remainder of this work is organized as follows. In Section 2, some background information is given on the three basic components of MPI-FM, which are Myrinet, FM for Myrinet and MPI. In Section 3, some related works are examined. In section 4, the details of the implementation of MPI on top of FM 1.1 are given. The performance measurements are presented in Section 5, along with the analysis of the results. Section 6 will conclude this work with a summary of the contributions, some nal considerations, and some topics for future research. 2 Background Workstations The workstations used for the measurements are Sun SPARCstation 20/71's. Each has a 75 MHz SuperSPARC-II processor and a 1 MB second level cache (SuperCache), and is rated at 125. SPECint 92/121.2 SPECfp 92. The SPARCstation 20 has a Sun-4m 4

5 architecture, whose main feature is a high performance cache-coherent memory bus (MBus). In our tests, the large second level cache was particularly important, as it reduced the cost of copying data out of the DMA region (the region of pinned memory in which FM stores incoming messages). The I/O bus (SBus) on which the network interface resides is clocked at 25 MHz and has a peak transfer rate of around 45 MB/s. A major problem with the Sun workstation architecture is that its I/O bus is optimized for large message transfers. Using programmed I/O (i.e having the CPU move one double-word at a time) instead of the interface's DMA reduces the peak transfer rate to 22 MB/s. In the present version of FM, this represents a major bottleneck for bandwidth and directly limits performance for long messages. Network Myrinet is a high speed LAN interconnect which uses bidirectional byte-wide copper links to achieve physical bandwidth of nearly 0 MB/s in each direction [3]. It uses the interconnect technology developed for the Caltech Mosaic project [1]. A Myrinet network is composed of network interfaces connected to crossbar switches by point-to-point links. The full crossbar switches have four or eight ports, and can be interconnected in an arbitrary topology. They use wormhole routing, which allows the packets to be switched with a latency of only about half a microsecond. The network interface consists of a custom VLSI chip (the LANai), 12 KB of fast SRAM, dierential line drivers/receivers for the link, and the SBus control logic. The LANai contains a link interface, a processor and three DMA engines (one each for the incoming channel, outgoing channel, and the SBus). The LANai processor is a rather slow CISC processor, and is clocked at the SBus speed (25 MHz in our machines). Host-LANai interaction is achieved by mapping the interface memory into the host address space. While the host can read/write the interface memory with load/store operations, it cannot start the DMAs. Single word accesses to the LANai memory are rather expensive because they cross the SBus. As in most systems, DMA transfers to/from the host must be performed through a pinned-down DMA buer in the kernel address space. FM for Myrinet Table I lists the three FM primitives. FM diers from a pure message passing paradigm in that there are two asynchronous send operations but no corresponding receives. Rather, each message includes the name of a handler, which is a user-dened function that is invoked upon message arrival, and that will process as required the carried data. When invoked, the FM extract() function processes the pending received messages, dequeuing them and executing their handlers. FM send4() is a specialized version for messages of no more than four words, optimized for latency. The FM interface is a generalization of the Active Message model [22], in that there are no restrictions on the communication operations that can be carried out by the handler. (The user is responsible for avoiding deadlock situations.) Reliable and in-order delivery guarantees can be expensive if implemented in the upper messaging layers [13]. Their cost can be decreased if built directly into the lower level layer, where there is an opportunity to take advantage of some useful features of the network. Through a careful design which exploits the characteristics of the Myrinet architecture, FM 5

6 Function FM send 4(dest,handler,i0,i1,i2,i3) FM send(dest,handler,buf,size) FM extract() Operation Send a four word message Send a long message Process received messages Table I: The FM 1.1 communication primitives oers reliable and in-order guarantees with minimal performance degradation. FM is composed of two parts, the host program and the LANai control program. The division of labour between the host and the interface processor is critical for performance because of the potential for parallelism in protocol processing but also for bottlenecks if the load is not properly balanced. The relative dierence in the speeds of the two processors and the modest performance of the LANai in absolute terms, suggest assigning most of the work to the host, and keeping the LANai control program as simple as possible. To send a packet, the host processor writes the data directly into the LANai memory, one double-word at a time (programmed I/O). This procedure saves the double cost of copying the data into the DMA buer, plus synchronizing with the LANai to get it to start the DMA. Since programmed I/O cannot take advantage the faster burst transfer mode of the SBus, this solution improves latency at the expense of bandwidth. When a packet arrives at a node, the LANai moves it into host memory (DMA buer) with a DMA transfer. This procedure ensures good bandwidth, quick delivery of the packet, and prompt draining of the network, even when the host program is busy or not available (i.e., descheduled). The packets are deposited into a receive queue in the DMA buer, from which they will be dequeued and processed by the FM extract() primitive. FM uses a xed packet format. This streamlines the queue management in the LANai control program. It also allows the overlapping of the protocol processing in the sender, network interface and receiver during the transfer of messages. A potential drawback is the increased complexity of the assemble/disassemble operations. Much of the work done on MPI-FM consisted precisely in dealing with this issue. The choice of the size of the packet was the result of a tradeo between contrasting benets. A small packet size gives better latency, increases pipelining through the network, and gives potentially better network utilization. A larger size reduces the overheads, and thus gives better bandwidth. The current size of 12 bytes has been found to be the best compromise. Messages longer than 12 bytes are segmented and then reassembled within FM. Given its limited programming interface, the FM library is targeted to language and library developers rather than to end users. The range of applications being considered for development on FM, or already in the works, is testimony to the exibility of its interface. Besides the MPI message passing library, these are the BSD socket interface, the Converse compiler back-end [12], the Tempest runtime library [11], the Orca Project parallel object language [2]. 6

7 MPI and MPICH MPI (Message Passing Interface) is a message passing library, with primitives specications for both C and Fortran. It is the result of the work of the MPI Forum, a committee composed of vendors and users started at the Supercomputing 92 conference with the aim of dening a message passing standard. MPI features a range of functionality, including point-to-point, with synchronous and asynchronous communication modes, and collective communication (barrier, broadcast, reduce, gather/scatter). MPI supports multicast and separate communication contexts by means of the communicator construct. A number of datatypes are predened, and userdened data types are supported as well. MPI comprises more than 120 functions. Such an intimidating size does not conict with the possibility of a low overhead, high bandwidth implementation. In fact, this number of primitives is the result of the orthogonal combination of a small number of concepts, so that each primitive is actually capable of rather streamlined implementation. The MPI standard has been gaining support in the parallel computing community since the presentation of the initial draft standard in 1993 and its formalization at the Message Passing Interface Forum [9]. One of MPI largest attractions is the number of free implementations that have been made available. The quick and ecient realization of the MPI library on top of FM was made possible by the existence of one of these publicly available implementations. Among the many MPI implementations, MPICH, developed by Argonne National Laboratory and Mississippi State University, appears to be one of the most popular 1. MPICH portability derives from being built atop a restricted number of hardware-independent low level functions, collectively forming an Abstract Device Interface (ADI). Implementing the ADI functions is all that is required to run MPICH on a new platform. The ADI encapsulates the details and complexity of the underlying communication hardware into a separate module. By restricting the services provided to basic point-to-point message passing, it oers the minimum required to build a complete MPI implementation, resulting at the same time in a very general and portable interface. On top of the ADI, the remaining MPICH code implements the rest of the MPI standard, including the management of communicators, derived datatypes, and the implementation of collective operations using point-to-point operations. Table II summarizes the core message passing routines, representing the minimum set required to have a fully functional implementation. The ADI can be described as a virtual message passing interface with split-phase operations. To send or receive a message, a request is posted. A request is a structure containing all the relevant data: pointer to the message, message length, receiver/sender id, tag, and communicator. A second call is required to complete the send or receive operation. A number of ancillary functions are provided to cancel an ongoing communication, get the number of nodes in the network, and probe for the arrival of a specic message. 1 The MPICH and other MPI implementations can be obtained at /mpi/index.html 7

8 Function MPID Cancel MPID Ceck device MPID Complete send MPID Complete recv MPID End MPID Init MPID Iprobe MPID Myrank MPID Mysize MPID Post send MPID Post send ready MPID Post send sync MPID Test recv MPID Test send MPID Test recv Operation Cancel a pending oper. Checks for pending oper. Completes a send Completes a receive Terminates the ADI Initializes the ADI Check if specic msg has arrived Rank of calling process Number of processes Starts a send operation Starts a send, ready mode Starts a send, synchronous mode Starts a receive operation Test for completion of a send Test for completion of a recv Table II: The ADI core routines 3 Related Work A number of other research projects are focusing on the construction of an integrated high performance hardware-software communication subsystem for network of workstations. U- Net [23] is built using FORE ATM interface cards. It presents an AAL5 programming interface to the applications. As in FM for Myrinet, the interface cards have a processor used to relieve the host processor of some tasks { for U-Net, segmentation and reassembly of messages, message demultiplexing, and DMA transfers to/from the host memory. One problem of ATM networks is the cost of the switch, much higher than the cost of a comparable Myrinet switch. Hamlyn[4] implements a sender-based memory management scheme, and gives applications direct access to the network interface. There are plans to implement it on a proprietary version of Myrinet for HP workstations. Cranium [] is in many respect similar to Hamlyn, and is to be implemented on a experimental interconnection network, Chaos. However, to our knowledge, at the time of this writing there are no high performance MPI implementations available on these interfaces. As mentioned earlier, a number of libraries are already available to do parallel computing on network of workstations. Libraries like PVM and MPI are typically built on top of the TCP/IP protocol found in virtually every workstation. The performance oered by TCP/IP, and consequently delivered to the libraries built on top of it, is one or two order of magnitude worse than what is available on MPPs. The reason is essentially that both the protocol itself and its traditional implementation within the operating system were designed with other objectives in mind rather than low latency and high bandwidth. For instance, tolerance to high latency and high error rates, or protection among processes using the

9 network were primary design objectives. Using a faster medium than Ethernet does not bring much improvement [15]. Some of the same libraries have been ported to MPPs. For example, two implementations of MPI are available on the SP2. One is a port of MPICH, the other (MPI-F) is a native implementation [10]. Both achieve 33 MB/s of peak bandwidth, with a 0-byte latency of 40.5 s for MPI-F and 55 s for the MPICH port. One feature of MPI-F is the packing/unpacking on the y of complex data types, which exploits a pipe abstraction provided by the underlying communication layer. The MPI for T3D 2 has been produced by the Edinburgh Parallel Computing Centre (EPCC) in collaboration with Cray Research, Incorporated (CRI). It achieves a minimum latency of 43 s; we measured a peak bandwidth of only 31.7 MB/s, despite a network which can provide up to 300 MB/s (however we have not tried all the possible versions of MPI send and receive). MPI is available on the AP1000, where it achieves 332 s minimum latency and 2.5 MB/s peak bandwidth [19]. 4 Design The approach followed in creating MPI-FM has been that of incrementally rening a straightforward implementation of the ADI. At each step, a dierent optimization was tested by modifying both the ADI and the FM interface as required according to the results of the performance analysis. Thanks to this incremental process, it has been possible to precisely measure the impact of each modication. In this section, a general outline of the design is presented, followed by a description of the dierent versions of MPI-FM resulting from the modications. In the next section the performance measurements for each version are presented. 4.1 General outline Figure 1 shows the composition of MPI Send and MPI Recv in terms of dierent layers' primitives. The MPID prex denotes the ADI functions. FM extract and the handler used for this type of receive, async handler, are connected with a broken line because they are responsible for the advancement of the ADI receive routines, even though they are not called directly by them. Notice how in this particular case an ADI function is dened in terms of two others. It is often the case, both in the ADI and in the MPI layer above it, that a more complex function is dened in terms of simpler ones. For example, collective communication functions are usually implemented in terms of point-to-point primitives. The ADI primitives are semantically much closer to the MPI than to the FM interface. As a matter of fact, only a few lines of code are required in MPICH to dene most of the pointto-point MPI functions in terms of ADI primitives. Consequently, much of the complexity is in the ADI-to-FM translation, which amounts to bridging a much wider semantic gap and two rather dierent programming models. The ADI-to-FM layer contributes the majority 2 Available at 9

10 MPI_Send MPICH layer MPI_Recv MPID_Blocking_send ADI layer MPID_Blocking_recv MPID_Post_send MPID_Complete_send MPID_Post_recv MPID_Complete_recv FM_send FM layer FM_extract, async_handler Figure 1: In MPI-FM the MPI primitives (MPI Send and MPI Recv in this gure) are built on top of the ADI routines (MPID prex). The core ADI routines are in turn implemented using the FM primitives (FM prex) of the additional overhead with the ADI to MPI layer contributing only on the order of a microsecond to the latency of a basic send-receive. We found the ADI to be quite well suited to matching a low level layer such as FM due to its generality (no unnecessary restrictions built in) and exibility (sucient provisions to support a variety of devices). The key issue in the implementation of the ADI is the dierence between the FM and ADI abstractions. As seen in Figure 1, there is not a one-to-one mapping between ADI and FM primitives. The two major dierences between the two interfaces are (i) the ADI primitives are split phase (\start send { complete send"), and (ii) FM does not have an explicit receive operation. The split-phase nature of the ADI communication primitives implies that a state must be maintained for each pending operation. This is an important dierence with FM, whose communication paradigm is stateless. To store the state of each request two queues have been employed, one on the sender and one on the receiver side. Each element of the queues contains a description of the request and its current state. The two queues will be called send queue and receive queue in the following discussion. On the receive side, an additional queue (unexpected queue) is needed to store unexpected messages, i.e. incoming messages for which a receive request has not yet been posted. Each element of the queue stores a pointer to a temporary buer, in addition to the description of the message. This queue and the associated buering would not be needed if messages were kept waiting on the send side and then transferred upon posting of the receive request. But this kind of transfer-on-demand protocol would result in a higher average latency than the eager transfer protocol with buering on the receive side, as employed by the ADI. In the latter scheme, a late receive would incur only the cost of a memory-to-memory copy, as opposed to the cost of a two way communication. Instead of an explicit receive operation, FM relies on user-dened functions (handlers) to handle the content of the received messages. Incoming messages are enqueued in the host memory, and are extracted in turn by FM extract, which executes the respective handler. 10

11 Implementing the ADI receive primitive meant writing the appropriate handler and carefully placing FM extract calls in the code. The task of the handlers is to copy the content of the incoming message into the destination buer whose pointer is found in the receive request for that specic message. A temporary buer is used in the case of a unexpected message. A dierent type of handler is required for synchronous operations, which is given the additional task of sending an acknowledgment message to the sender. The numerous calls to FM extract guarantee that the messages are extracted, and their handlers executed, with the smallest possible delay. 4.2 The four versions of MPI-FM Two signicant optimizations of the ADI and FM layers are analyzed in this paper. These are a gather mechanism on the send side, and an upcall function on the receive side. The four versions of MPI-FM described are (i) the base implementation, (ii) with the gather only, (iii) with the upcall only, and (iv) the nal version with both optimizations. MPI-FM base implementation The latency and bandwidth curves of the rst version of MPI-FM are shown in Figure 2. The 0-byte latency and the peak bandwidth of MPI-FM are 21 s and 3.9 MB/s respectively, versus 14 s and 17.6 MB/s for FM. While the base implementation is quite good in absolute terms, its distance from the FM curves shows that a good share of the FM performance was lost, and there was plenty of room for improvement FM MPI-FM base us MB/s FM MPI-FM base K 2K 4K K K 2K 4K K K (a) Latency (b) Bandwidth Figure 2: Performance of the MPI-FM base implementation Through detailed measurements, the overhead was broken down to its components. As expected, the biggest components were memory-to-memory copies. These copies are employed 11

12 in the straightforward implementation of some typical operations, like adding a header to a packet, found in almost every communication layer. Therefore, exploring ways of eliminating them is of more general interest. Since the copies were being utilized for dierent purposes throughout the code, we had to come up with a dierent solution in each case, most of them involving modications to FM. While these modications are very specic to the implementation of MPI-FM, an upcoming release of FM will implement the same principles in a much more generalized fashion. Optimization #1: MPI-FM with upcall One of the memory-to-memory copy operations targeted for elimination was being used on the receive side in the reassembly of long messages. As shown in Figure 4 (middle), within FM long messages (> 104 bytes) are broken in packets on the send side, then put back together in a temporary buer on the receive side. Upon completion of the reassembly, the application-specic handler specied in the message is called. The handler usually copies the message content to its nal destination within the application. For example, in the case of the ADI implementation, the nal destination is the buer that the MPI user specied in the receive operation (or a temporary buer, if there is no pending receive). The reassembly of long messages before they are handed out to the application handler represents an additional copy with respect to the case of short messages. Figure 3 shows the FM performance degradation resulting from this additional copy. The dierence in the two measurements is in the action performed by the application handler: in one case the message is not touched once reassembled, in the other case is being copied out to a destination buer. us FM FM with additional copy K 2K 4K K K32K MB/s FM FM with additional copy K 2K 4K K K 32K (a) Latency (b) Bandwidth Figure 3: FM: performance impact of copying the data out of the reassembly buer on the receive side 12

13 MPI_Send buffer (MPI-FM) LANai + network DMA region (FM) Reassembly buffer (FM) MPI_Recv buffer (MPI-FM) Small messages (< 104 bytes) Segmentation & reassembly Segmentation & reassembly with upcall Figure 4: Segmentation & reassembly in MPI-FM The ideal solution would be to reassemble the message directly in the destination buer (Figure 4, right). Let's look at this possibility from the point of view of the implementor of FM (or of any similar low level messaging layer). There are two obstacles to overcome: (i) the destination buer for the incoming message is generally known only to the application code running on top of FM (the ADI in the case of MPI-FM), and (ii) only the message body, i.e. the payload, must go into the destination buer, not the headers added by FM or the ADI). The location of the destination buer is in general only known by the application. Furthermore, the application needs to know the identity of the message in order to choose the appropriate buer. In other words, the knowledge about what buer to use is constrained both in space (only the application code on the receiver side knows) and in time (the code knows only after the message has arrived). For example, the ADI needs to know the source, tag and communicator of an incoming message to tell which (if any) receive request has been posted for it. Rather than trying to relax these constraints, for example by duplicating on the send side the knowledge about which buer to use, we adopted an inquiry mechanism employing an upcall [6]. An upcall is a function dened inside an upper layer and invoked by the layer underneath. Whenever the rst fragment of a long message arrives, FM asks the application code for a buer to use as reassembly buer. This is accomplished by invoking the upcall with the fragment as an argument. The upcall return value, if not null, is assumed to be a pointer to a buer; if null, FM provides its own temporary buer. The passed fragment contains sucient information to determine the identity of the incoming message because it is the head of the message, where the application code puts a header containing the message attributes 3. To allow the application to drop the header of the incoming message before copying it in the destination buer, the reassembly mechanism of FM has been slightly modied to allow the upcall to increment the pointer to the fragment passed as argument. Once the function has read the information needed for message identication, it can advance the pointer to the 3 It also contains the header that FM adds to each fragment, but since the FM header is actually appended to the fragment, the user code can safely ignore it. 13

14 start of the payload. FM will then start copying the fragment into the reassembly buer starting from the updated position. As mentioned earlier, this mechanism is used only for long messages. Messages no longer than a packet are not segmented and don't require reassembly; thus FM is not involved in pulling them out of the incoming messages queue. That is, in the case of short messages, there are no unnecessary copies to be eliminated. Another consideration about the upcall is that it exposes the issue of upward/downward interfacing between contiguous layers. A common view is that in a protocol stack a layer must oer a number of services to the one immediately above it, which in turns uses them to implement more complex functionality. A less explored aspect of this interaction is the opposite relationship, that is the services an upper layer can usefully oer to the one below. It is sometime the case that a certain piece of information needed by a low level layer is available only in an upper layer. In the MPI-FM case, the information about the buer to use, needed by FM, is available in the ADI. The ADI in turn rst needs to know the identity of the message before it can choose the corresponding buer. The problem with the traditional layered and rigidly hierarchical model is that it is too restrictive on how functionality can be distributed across the layers. For example, building a rigidly hierarchical MPI-FM would require putting the buering in FM, rather than in the ADI, so to avoid the bottom-up exchange of information. Conversely, the adoption of the upcall adds one degree of exibility, by allowing the low level (FM) to query the upper level (ADI). Optimization #2: MPI-FM with gather The other copy targeted for elimination was the one used on the send side in assembling packets. Typically the code running on top of FM adds its own header to the data before sending it to another node. This header contains information about the message being transmitted, so that the code on the receiving node can identify it. Once utilized, the header can be disposed of and the data retrieved from the message. In this process, the crucial steps are adding the header ahead of a given data buer, and getting rid of the given message's header. This is a special case of the more general problem of assembling a packet out of several scattered pieces and later disassembling it into pieces at dierent memory locations. The simple minded approach is to use an intermediate buer, to copy in the pieces on the send side, or to copy them out on the receive side, at the cost of an additional copy. Figure 5 shows the performance degradation at the FM level due to the additional copy. The same number of bytes is being transmitted in the two tests. But in one case they are from a single contiguous memory area, in the other they are from two separate buers (the rst containing 24 bytes, the second containing the rest) and must rst be copied into a contiguous buer. This issue arises from the fact that FM, like many other low level messaging layers, deals only with messages that are contiguous in memory. For FM, however, this restriction can be lifted without major changes to its structure. When the FM send primitive is invoked, the message is copied to the network interface a double word at a time in sequence. Since each double word could be copied from an arbitrary location instead, a contiguous message is not strictly required. Thus, the FM send has been modied to accept a two-part message. The rst part ts the header size used by the ADI (24 bytes), the second part can be of arbitrary size. The new primitive, FM gather24, performs a gather on the y by copying all 14

15 FM FM with two-part message us MB/s FM FM with two-part message K 2K 4K K K 32K K 2K 4K K K 32K (a) Latency (b) Bandwidth Figure 5: FM: performance impact of assembling a contiguous message on the send side 24 bytes from the rst segment, and all the remaining data from the second segment directly into the network interface. When the message is bigger than a packet, the second segment is fragmented as it would be in FM send. Despite being quite a primitive gather facility, FM gather24 proved to be sucient to substantially reduce the overhead of the send side. Since FM does not have an explicit receive operation with which to build the dual operation, scatter, on the receive side, the two-part scatter had to be implemented elsewhere. The pointer update feature built into the upcall function, although limited to dropping the header, has conveniently substituted for a full edged scatter in this context. The assembly and disassembly of packets is a common function performed by messaging layers. In its simplest form, a header needs to be added in front of a payload, containing information for the corresponding layer on the receiving node. On the receiver side the opposite operation is performed. The ADI, and the underlying FM, are not the only example of this kind of packet manipulation; most of the layers in the ISO and ATM hierarchies perform the same type of operation. A two piece assembly/disassembly is the simplest case of the general gather/scatter functionality which allows a message to be built out of a number of pieces. The need for a bigger number of components is commonly associated with the packing/unpacking of messages containing noncontiguous data types. A more general form of gather/scatter is being considered for implementation in FM; it could be usefully employed in the assembly/disassembly of messages containing MPI derived data types. It is worth noting once more that the most common form of gather is also the simplest, and it can already make a big dierence. Whether involving two pieces or more, the crucial issue for the gather/scatter is avoiding additional copies. Copies can be avoided if the capability of accepting messages in multiple 15

16 parts is preserved all the way down the layer hierarchy to the lowest one, where the actual copying of the data in the network interface memory takes place. Restoring this feature in FM was not only possible, even if in a limited form, but also inexpensive in terms of overhead. More in general, achieving an eective division of tasks across layers involves choosing the best place where to implement the services oered. Some services, like reliable delivery, are considered essential in most applications, and thus have to be implemented at some level in the layer stack. Adding a new service has a cost, which can vary widely depending on which layer is chosen for the implementation, the facilities already available in it, and other contingencies. For example, implementing the two-piece gather feature in FM rather than in the ADI improved the overall MPI-FM performance. Since FM is copying the message in the network interface a double word at a time, it does not really make much of a dierence if the message is in two pieces rather than in one contiguous block. Instead, implementing the gather within the ADI in the initial version of MPI-FM exacted the cost of the additional copy of the message pieces into a temporary buer. Final version: MPI-FM with upcall and gather The upcall and the gather are orthogonal techniques, and can be used together. Each reduces the processing overhead on its side of the communication: the gather on the send side, and the upcall on the receive side. Thus, their eects can potentially be cumulative. 5 Measurements and analysis The following sections describe the eects of the gather and the upcall on MPI-FM. The two modications will be presented separately rst, and then combined in what is the nal version of MPI-FM. The performance of each version will be contrasted with that of FM and that of the base version, which represent the performance target and baseline respectively. 5.1 Methodology The network used for the measurements presented in the following is composed of two SPARCstation 20/71's running Solaris 2.4 and connected with a Myrinet network. The network interfaces use the new version of the LANai chip { LANai 3.2, which is a beta version of the LANai 4.0. The compiler used is gcc ver The LANai compiler and Myricom software distribution employed are the 3.0. All measurements have been performed using the timer on the network interfaces (through the MPI Wtime function) which oers microsecond accuracy. Each measurement is repeated a number of times, and the median time is taken, to lter out spurious results due to process descheduling, CPU load surges, or other factors. Measurements were performed in multiuser mode, but while no other users were using the machines, and no big jobs (i.e. with %CPU > 1 as shown by the ps -aux command) were being run. In the measurements, the size of the messages is referred to the payload only (i.e. the size does not include the header). The latency test is an MPI program which measures the time to send a sequence of messages, every time waiting for the message to be sent back. The time is taken for a sequence

17 long enough to reduce the timer granularity error to acceptable levels. The communication primitives used are MPI Send and MPI Recv. The bandwidth test is an MPI program which measures the time to send a sequence of messages back to back from one node to another. The clock is stopped when the last message is acknowledged. The communication primitives used are MPI Send, MPI Irecv and MPI Waitall. The versions of the send and receive operations used in the tests are the most commonly used, and thus are at the highest premium to be ecient and robust. Because of the structure of the MPICH code, their performance is a good indicator of the performance of the other MPI communication versions in MPICH, as most of the other primitives use the same ADI primitives employed by the basic send and receive. This relationship may not be the same in implementations of MPI which exploit special underlying hardware features for collective communication. In all of the tests performed, the timing was such that the receives occurred to be always posted in advance, even without explicit pre-posting or use of barriers. This situation has been explicitly veried, both by instrumenting the code and by writing specic MPI tests. 5.2 The four versions MPI-FM base version performance The comparison of the latency and bandwidth curves of FM with those of the rst version of MPI-FM were presented in Figure 2. The 0-byte latency and the peak bandwidth of MPI-FM are 21 s and 3.9 MB/s respectively, against 14 s and 17.6 MB/s for FM. Most of the FM performance is lost in this version. The dierence in latency, that is, the additional overhead contributed by MPI-FM, grows rapidly with message size. At 256 bytes, the latency of MPI-FM is already one order of magnitude higher than that of FM. The peak bandwidth is reduced to just one fourth of the FM peak value. Since the additional overhead keeps increasing with size, it cannot be amortized even with big messages. To facilitate the evaluation of the improvement introduced by each optimization, these curves will be reported in the next graphs as the starting point and the target respectively. Optimization #1: MPI-FM with upcall The information regarding the appropriate buer to use is available only to the ADI code, which needs to know the identity of the incoming message before it can decide which one is the right buer to give to FM. For this reason, the ADI is interrogated by FM right after the arrival of the rst packet of the message, which contains the header and thus the identity of the message itself. The upcall function is also given the task of removing the header once it has utilized its content. Figure 6 shows the latency and the bandwidth of MPI-FM with the addition of the upcall. Both latency and bandwidth present a noticeable improvement for message size greater than or equal to 12 bytes. Optimization #2: MPI-FM with gather Figure 7 shows the latency and the bandwidth when the simple gather mechanism is added to FM. As described earlier, the new version of FM send() called FM gather24() accepts a message composed of two parts, a 17

18 FM MPI-FM with upcall MPI-FM base us MB/s FM MPI-FM with upcall MPI-FM base K 2K 4K K K 2K 4K K (a) Latency (b) Bandwidth Figure 6: Performance impact of the upcall in MPI-FM header and a payload. The header has a xed length of 24 bytes (to suit the needs of the ADI), the payload can have arbitrary length. The use of the FM gather24() removes the need for the additional copy on the send side, with a visible benet for both latency and bandwidth. Final version: MPI-FM with upcall and gather A global view of the dierent contributions to performance is shown in Figure, in which the graphs of all four dierent versions of MPI-FM are reported. Both optimizations, when individually applied, contribute to improve latency and bandwidth. However it is only when the sender and receiver are both optimized that the full performance of MPI-FM can be achieved. The impact of the joint application of gather and upcall is more than the sum of the individual contributions. The reason has to do with the pipelining of consecutive messages, and is best explained with the help of Figure 9. In this gure the time taken to transfer a message is decomposed into three components, one for each of the processors involved: sender processor cycles (sender overhead), LANai cycles on both interfaces plus network latency, and receiver processor cycles (receiver overhead). When consecutive messages are allowed to overlap, two cases are possible. In the rst case, the sender can start sending a new message as soon as it is done with the previous one, while the receiver is periodically idle, waiting for the arrival of the next one. This is the case of the MPI-FM version with upcall; the bandwidth of the system is still limited by the sender. In the other case, the receiver is slower. In this situation the buers eventually ll up, and the sender is slowed down by the back pressure exerted by the network. A new message 1

19 FM MPI-FM with gather MPI-FM base us MB/s FM MPI-FM with gather MPI-FM base K 2K 4K K K 2K 4K K (a) Latency (b) Bandwidth Figure 7: Performance impact of the gather in MPI-FM 204 us FM MPI-FM with upcall + gather MPI-FM with upcall MPI-FM with gather MPI-FM base MB/s FM MPI-FM with upcall + gather MPI-FM with upcall MPI-FM with gather MPI-FM base K 2K 4K K K 2K 4K K (a) Latency (b) Bandwidth Figure : Comparison of the dierent versions of MPI-FM 19

20 Sender limited bandwidth Receiver limited bandwidth Sender processor overhead LANai + network Receiver processor overhead Figure 9: The pipelining of messages in the bandwidth test will be sent only after the receiver has pulled a prior one out of the network on the other side. This is the version with gather; the system bandwidth is receiver limited. With only one optimization at work, the other unoptimized side of the communication becomes the bottleneck. Both are needed at the same time to achieve the full potential of the network. A more general observation is that a message being transferred from one node to another is actually being pipelined through a series of stages. The stages of the pipeline must be lean and tightly coupled for the sake of latency, and well balanced for the sake of bandwidth. In MPI-FM the pipeline is well balanced, and thus performance is good, over all message sizes. This issue is complicated by the fact that dierent subsystems of a network of workstations are involved. Processors (both on the host and on the network interface), memory hierarchies, I/O buses and network links are the results of dierent technologies, which evolve at dierent speeds. For example, I/O buses change much more slowly among successive generations of workstations than processors or memory architectures. This means that a balanced pipeline on one workstation will not necessarily be so on the next generation. Even dierences in the memory conguration within the same generation can make a big dierence. The nal judgment on how successful the tuning of MPI-FM has been can be made by comparing the curves of the nal version to those of FM. Accounting for a left translation due to the inclusion of the ADI header in every message, the MPI-FM curve substantially reproduces the shape of the FM curve (compare the bumps corresponding to the start of the segmentation and reassembly). The additional MPI-FM overhead is now limited to an approximately constant value of around 6 s. The MPICH/ADI layer is suciently \thin" and well matched to the FM interface that it does not contribute a signicant amount of overhead. This is in line with the objective of the MPI-FM implementation, of having the biggest possible share of FM communication performance delivered to the application. The dierence between latencies is about 6s for short messages. The peak bandwidth of 17.3 closely approaches the 17.6 MB/s of FM. 5.3 MPI-FM performance in perspective To put the results achieved with our version of MPI in the right perspective, we compare its performance to that of MPI libraries running on two mainstream MPPs, the IBM SP2 and the 20

Cross-platform Analysis of Fast Messages for. Universita di Napoli Federico II. fiannello, lauria,

Cross-platform Analysis of Fast Messages for Myrinet? Giulio Iannello, Mario Lauria, and Stefano Mercolino Dipartimento di Informatica e Sistemistica Universita di Napoli Federico II via Claudio, 21 {