Table 1. Aggregate bandwidth of the memory bus on an SMP PC. threads read write copy

Size: px

Start display at page:

Download "Table 1. Aggregate bandwidth of the memory bus on an SMP PC. threads read write copy"

Lynette Walsh
5 years ago
Views:

1 Network Interface Active Messages for Low Overhead Communication on SMP PC Clusters Motohiko Matsuda, Yoshio Tanaka, Kazuto Kubota and Mitsuhisa Sato Real World Computing Partnership Tsukuba Mitsui Building 16F, Takezono Tsukuba, Ibaraki , Japan Abstract. NICAM is a communication layer for SMP PC clusters connected via Myrinet, designed to reduce overhead and latency by directly utilizing a micro-processor equipped on the network interface. It adopts remote memory operations to reduce much of the overhead found in message passing. NICAM employs an Active Messages framework for exibility in programming on the network interface, and this exibility will compensate for the large latency resulting from the relatively slow micro-processor. Running message handlers directly on the network interface reduces the overhead by freeing the main processors from the work of polling incoming messages. The handlers also make synchronizations faster by avoiding the costly interactions between the main processors and the network interface. In addition, this implementation can completely hide latency of barriers in data-parallel programs, because handlers running in the background of the main processors allow reposition of barriers to any place where the latency is not critical. 1 Introduction Symmetric multiprocessor PCs (SMP PCs) have recently attracted widespread attention, and clusters of SMP PCs with fast networks have emerged as important platforms for high performance computing. While the bus bandwidth sometimes limits the computation performance, SMP PCs easily reveal a bottleneck when multiple processors are accessing the bus simultaneously. Even worse, the network interfaces for clustering further burden the bus. Thus, we designed a communication layer NICAM which reduces the communication overhead by utilizing a micro-processor on the network interface. Overhead reduction is important, because the overhead, the involvement of main processors in communication, wastes bus bandwidth as well as the processing power of the processors. In addition, a common technique of overlapping of computation and communication tends to make the communication grain-size ner, which results in larger total cost of overhead. As researchers show [1], while latency reduction in data transfer is not so relevant for performance, overhead reduction directly aects the utilization of the processing power. Thus, overhead reduction by the network interface will be fruitful while it incurs larger latency resulting from the relatively slow micro-processor on the network interface.

2 Table 1. Aggregate bandwidth of the memory bus on an SMP PC. threads read write copy (MB/s) Table 2. Barrier synchronization time for threads in a node. threads SMP PC Sun SMP (sec) NICAM also reduces latency in synchronization primitives. While latency is not the rst issue in data transfer, latency is the only issue in synchronization. Direct handling of messages by the network interface not only frees the processors from polling overhead, but also eliminates the costly interaction between the processors and the network interface. NICAM employs an Active Messages framework [2] for exibility in programming the network interface. Active Messages provide extensibility by simple additions of new handlers, because they run almost mutually independently. This exibility allows combination of data transfer primitives with synchronizations, which compensates for large latency. This paper reports on the design of NICAM and its basic performance. We present our platform PC cluster in Section 2, and the NICAM primitives in Section. Then, we present a technique of latency hiding and two sets of experimental results in Section 4. We briey discuss related work in section 5 and conclude in section 6. 2 Background 2.1 SMP PC Cluster Platform Our research platform, COMPaS [], is a PC-based SMP cluster, consisting of eight server-type PCs. Each node contains four Pentium Pro's (200 MHz, 512 KB L2 cache, 450GX chip-set, 256 MB main memory), and a Myrinet network interface card. They are connected by a single Myrinet switch. The operating system is Solaris 2.5. This section presents the performance characteristics of the node which guided our design of the communication layer. Table 1 shows the memory bus bandwidth for read, write, and copy when multiple threads execute operations simultaneously. Each gure shows the aggregate bandwidth, that is, the sum of the measured bandwidth of each processor. Notice that the aggregate bandwidth is almost independent of the number of threads. This means that a single processor can consume all the bandwidth available to the memory bus for these simple operations. Table 2 shows barrier performance among the threads in a node to compare the cache coherency performance. The column Sun SMP shows the results of the Sun Enterprise 4000 for comparison. The algorithm of this barrier uses write

3 operations to a location dedicated to each processor, and detects the condition that all locations are written. A single processor checks the condition and noties the others. This algorithm does not need any atomic operations and is regarded as reecting the cache coherency performance. The result shows that the SMP PCs have a good cache coherency implementation comparable to a more sophisticated SMP machine. 2.2 Myrinet Myrinet is a Giga-bit LAN system from Myricom Inc. [4]. It consists of communication links, switches, and host interfaces. The communication link is a bi-directional 8-bit data path, whose speed is 160 MB/s in each direction. The switch is an 8-by-8 crossbar and uses cut-through routing. Switches can be cascaded in an arbitrary topology, and the route is statically specied in the header of a packet. The host interface is implemented as an I/O adaptor, and performs data transfers from/to the host memory system using a DMA (Direct Memory Access). Each Myrinet board contains three DMA engines, each dedicated to the main memory, transmitter and receiver, respectively, and a micro-processor to control these DMA engines. The on-board SRAM is used both to buer messages and to store the program of the micro-processor. The following features guided the design: On-board micro-processor: The micro-processor is a 2-bit custom RISC CPU core with a general purpose instruction set. Reliable link hardware: Links are reliable to assume they are error-free. Unrestricted DMA capability: The DMA engine is capable of accessing the whole physical memory in the host PC. Large SRAM size: The SRAM size is up to 1 MB depending on the board type. NICAM Primitives.1 Communication Layer Design NICAM provides primitives for remote memory operations and synchronization primitives. We based communication primitives on remote memory operations, because they are preferable to message passing with respect to overhead. That is, message passing suers from ow-control and buer management tasks for handling incoming messages, and sometimes requires copying messages which sacrices bus bandwidth. In addition, message passing may need mutual exclusions to coordinate processors in an SMP node. In NICAM, all events to the main processors, such as completion of a data transfer or a barrier, are notied via a ag in the main memory, because it reduces much of the overhead of the main processor. Some primitives take a ag argument which points to a memory location. The ag is set by the network interface when a condition is satised.

4 Initialization nicam_init() nicam_lock_memory(addr, range) Simple Data Transfer nicam_bcopy(src_node, src_addr, dst_node, dst_addr, size) nicam_sync(flag_addr) nicam_write1(dst_node, flag_addr, val) Data Transfer with Synchronization nicam_bcopy_notify(src_node, src_addr, dst_node, dst_addr, size, flag_addr, onoff) nicam_set_counter(flag_addr, count) nicam_bcopy_countup(src_node, src_addr, dst_node, dst_addr, size) Broadcast nicam_bcast(src_node, src_addr, dst_addr, size, flag_addr, onoff) nicam_bcast_discard(onoff) Barrier Synchronization nicam_barrier(flag_addr) Message Passing Support nicam_bcopy_src(key, dst_node, src_addr, size, flag_addr, onoff) nicam_bcopy_dst(key, src_node, dst_addr, size, flag_addr, onoff) Fig. 1. NICAM primitives. NICAM is designed for a single job, and makes exclusive use of the resources for communication. This is not a problem practically, since we assume a dedicated environment for a parallel job to investigate the utilization of the resources in a cluster. In addition, it requires that the remotely accessed regions of memory should be pinned-down in advance of remote memory operations. The pin-down operation protects the region of memory from the paging system in a virtual memory environment. Some other systems take alternative approaches and use only limited pinned-down areas [5]..2 Remote Memory Operations Figure 1 lists the set of primitive operations supported in NICAM. Remote memory operations have a similar interface to the local copy (bcopy) operation, but have additional arguments to specify source and destination nodes. If the source is specied as a node other than the local one, it can act as a remote read. An Active Messages mechanism forwards the request to the specied source node. nicam_sync is used by the invoking node to know the completion of currently issued copies. A variant of a copy operation, nicam_bcopy_notify, provides a data transfer primitive combined with point-to-point synchronization. It noties the destination node about the completion of a copy by setting a ag. This is useful for programs taking a message passing style. While NICAM does not directly support message passing, many message passing programs can be rewritten using

5 Main Proc. barrier request ag update NI0 NI1 NI2 NI H H Hj H 88 H Hj H 88 H Hj XXXXXXz 9 XXXXXXz HHj HHj HHj 2-way join 2-way join Fig. 2. Steps in a multi-stage barrier performed between network interfaces (NI0{NI). this primitive. It is also useful in optimization to reduce the number of synchronizations in data-parallel programs, where it is assumed that completions of data transfers are notied to the destination [6]. Another variant, nicam_bcopy_countup provides a counted completion. Sometimes synchronization points are known by the number of data items exchanged. For example, in all-to-all communication, the synchronization point is reached when the count reaches the number of nodes involved. This completely avoids explicit synchronizations. nicam_set_counter species the count and the ag address to signal completion.. Barrier Synchronization The barrier primitive signals the completion by setting a ag in memory, too. This not only lowers the notication overhead, but also makes it a fuzzy barrier [7]. In addition, its implementation uses a relaxed algorithm which allows concurrent progress of multiple instances of barriers. That is, barriers can be invoked multiple times before the completion of a previously issued one. This feature tolerates latency and is useful when the environment has a large variance of loads, or the scale of a cluster is large. Figure 2 shows the steps in a barrier execution. The barrier is implemented using a multi-stage log(p ) step algorithm, and each stage of the barrier performs a 2-way join and forwards a message to the next stage. The node n performs a join with the node n 8 2 i at the i-th stage. Small queues are placed in the join to accommodate multiple instances of barriers. It is benecial to avoid the involvement of the main processors, where the algorithm needs multiple stages. It would be worthless in relaxing barriers if the execution required polling by the processors..4 Implementation Incoming requests of Active Messages are handled using a simple polling on the network interface. There are three sources of requests: the main processors

6 MB/s Bandwidth Bytes sec Latency Bytes Fig.. Basic Performance of NICAM. Bandwidth (throughput) and latency (one-way) of remote memory data transfer. and the two DMA engines of the receiver and the main memory, respectively. Active Messages are also used for local requests from the main processors, which simplify the implementation by making the local and remote requests be handled in the same way. Hand-shaking is used between the main processors and the network interface to manage requests. The main processor makes a request by writing a ag in the SRAM on the network interface, and the network interface makes an acknowledgment by writing a ag in the main memory. In order to avoid mutual exclusions among the main processors, there are multiple pairs of ags, one for each processor. NICAM passes only virtual addresses between nodes, and they are translated to physical addresses by the network interface. It maintains its own copy of the address translation table in the SRAM. The SRAM on the network interface is large enough to hold the whole table for the entire physical memory..5 Basic Performance Figure shows the bandwidth (throughput) and the latency of nicam_bcopy, up to sizes of 16 KB for bandwidth and 4 KB for latency. The maximum bandwidth is about 105 MB/s observed at 64 KB, and the start-up parameter N 1 2 is a little below 2 KB. The minimum latency for small messages is about 20 sec (one-way). Since NICAM uses pinned-down areas and does not need any buering or ow-control, these gures show the actual performance observable from applications. Figure 4 compares the barrier synchronization time between NICAM and a message passing library PM [5]. We implemented a barrier on top of PM, which is performed by the main processor using message passing. It uses essentially the same algorithm as the one used in NICAM. The gure shows that, while NICAM is slower than PM for two nodes, it becomes faster as the number of nodes increases. This is because the cost of interaction between the main

7 sec NICAM PM Nodes Fig. 4. Barrier synchronization time between nodes using NICAM and a message passing library PM. processor and the network interface is larger in NICAM, but the interaction occurs only once in NICAM. The call overhead of most NICAM operations is about 5.7 sec. The break down of the overhead is: (1) argument checks, (2) copying of a handler address and arguments into the SRAM on the network interface, and () a word write to a ag in the SRAM to start processing. Serializing instructions (the CPUID instruction) are inserted after steps (2) and () to ush the write buer. Although the serializing instruction has a very large cost, it is necessary because the write buer of Pentium Pro reorders write requests. 4 Hiding Latency and Experimental Results 4.1 Hiding Latency of Barriers in an Array Class The main target application of NICAM is a scientic computing library in C. The array class library may be considered in a variant of the BSP (Bulk Synchronous Parallel) model. However, while synchronization points are explicitly specied in the programs in the BSP model, they are implicit in the array class and it is necessary to synchronize at each expression or statement. Some researchers have been successful in avoiding explicit barriers at the end of the super-steps in a message passing environment [8]. However, synchronizations may still be required at the beginning when the communication is based on remote writes, because there is a possibility of overwriting the memory contents on which local processing is currently in progress. Since the library has no knowledge of the use of arrays, it would have to strictly lock-step in each super-step, and this would make the library sensitive to latency. The barriers at the beginning of super-steps can be eliminated as mentioned in [6], where a new area is always allocated to store the results. It is clear that the remote writes are directed to those newly allocated locations and never overwrite the existing one. However, incorporating this technique requires modication to the storage reclaimer because the storage cannot be freed locally. That is, the

8 Speedup Overlap No Overlap Threads Fig. 5. Dierence of speed-up between Laplace solvers with/without overlapping. sec Latency Hiding No Latency Hiding Size Fig. 6. The eect of latency hiding in the cshift operation. Size N for a two-dimensional array of N 2 N. storage could be overwritten if a storage were reclaimed and then reallocated independently while other nodes still had a reference to it. The storage should be reclaimed when all nodes are ready to release it. This condition can be checked by the use of barriers. The reclaimer rst signals a barrier on a location specic to the freed storage, then checks the completion of the barrier later. Completion indicates that all the nodes are ready to release it. The implementation of the barrier has the preferable properties of background execution and the concurrent progress of multiple instances. The background execution is important, because the barriers are triggered at the storage reclamation and it is hard to nd insertion points for polling. It would otherwise degrade performance when the polling by the main processors were scattered in every operation. The concurrent progress is also important, because the variance of issuing of barriers is large and the barriers may overlap. 4.2 Overlapping Communication and Computation Overlapping of computation and communication is a common technique in exploiting hybrid distributed/shared-memory programming on SMP clusters []. The communication overhead must be small enough for overlapping to be eective. An overlapping eect is presented for an explicit Laplace equation solver using the Jacobi method. In each iteration, a new array is computed from the old array by averaging the four neighbors in a two-dimensional space. The array is partitioned into equally-sized strips and these are distributed among the nodes. Although the values of the boundary elements need to be exchanged between nodes, computation can be started without waiting for the completion of the exchanges, because the internal elements have no dependency on the ones exchanged. Figure 5 shows the result of the experiments, when the number of threads is varied by 1, 2, and 4. The number of nodes is xed to eight. The result shows the speed-up relative to the single processor, for the array of size

9 The speed-up is not good because of the large bus trac in this application. However, the gain of overlapping becomes larger, even though the number of communications increases in proportion to the number of threads. 4. Eect of Hiding Latency The eect of hiding latency of the barrier synchronization is shown using the two versions of a cshift (cyclic shift) operation. One version is a bulk synchronous version which uses a barrier at the beginning of the operation, and the other is a latency hiding version which uses a barrier in the storage reclaimer. Figure 6 shows the eect of the latency hiding. All the eight nodes are used, but only a single processor on a node is used. The shift size is one. The gure shows that the whole barrier cost is hidden (the gain is over 0 sec). Also, slightly more gain is achieved by the relaxation eect of the synchronization condition in each step, because no explicit barriers appear in the code. 5 Related Work Schauser et al reported on experiments running Active Messages on a network interface [9]. Also, Krishnamurthy et al reported on running the Active Messages handlers for Split-C primitives on various network interfaces [10]. They reported that low latency is achieved on platforms such as the Paragon and Berkeley NOW. However, NICAM further exploited the benets of utilizing the network interface. There are a number of fast message passing layers for the Myrinet network, such as BIP [11], FM [12], and PM [5]. These are concerned mainly with the bandwidth and latency. In contrast, NICAM aims at reducing overhead, but it shows some disadvantage in bandwidth and latency. This is due to the fact that NICAM runs much more work on the relatively slow micro-processor. 6 Conclusion NICAM makes use of the micro-processor on the network interface to reduce the overhead in data transfer, and also to reduce the latency in synchronization. In addition, background execution of barriers can completely hide latency. It employs an Active Messages framework for exibility in programming the network interface, which allows an easy integration of new primitives, such as the ones to help rewrite message passing by remote memory operations. While NICAM is based on remote memory operations, it is not considered restrictive in data-parallel programming. The one-sided communications are often more suitable to implement data-parallel operations than message passing. Many data-parallel operations are straightforwardly implemented using remote memory operations, including shift, scan, and exchange operations. In contrast, it is sometimes necessary in message passing to swap the order of code to make

10 a sender-receiver pair, or to use asynchronous operations to avoid a pair and to tolerate latency. While many distributed memory MPP machines provide remote memory operations and fast barrier synchronizations in hardware, NICAM attempts to provide these operations using commodity network hardware. In NICAM, the data transfer primitives are optionally combined with actions for synchronization and the synchronization primitives take latency hiding into account, which will compensate for the relatively large latency in a software implementation. References 1. R. P. Martin, A. M.Vahdat, D. E. Culler, T. E. Anderson: Eects of Communication Latency, Overhead, and Bandwidth in a Cluster Architecture. Int'l Symp. on Computer Architecture (ISCA'97) (1997). 2. T. von Eicken, D. E. Culler, S. C. Goldstein, K. E. Schauser: Active Messages: a Mechanism for Integrated Communication and Computation. Int'l Symp. on Computer Architecture (ISCA'92), pp (1992).. Y. Tanaka, M. Matsuda, M. Ando, K. Kubota, M. Sato: COMPaS: A Pentium Pro PC-based SMP Cluster and its Experience. IPPS Workshop on Personal Computer based Networks of Workstations (PC-NOW'98) (1998). 4. N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic, S. Wen-King: Myrinet { A Gigabit-per-Second Local-Area Network. IEEE MI- CRO, Vol.15, No.1, pp.29{6 (1996). ( 5. H. Tezuka, A. Hori, Y. Ishikawa, M. Sato: PM: An Operating System Coordinated High Performance Communication Library. High-Performance Computing and Networking, LNCS 1225, pp.708{717, Springer-Verlag (1997). 6. M. Gupta, E. Schonberg: Static Analysis to Reduce Synchronization Costs in Data- Parallel Programs. Symp. on Principles of Programming Languages, pp.22{2 (1996). 7. R. Gupta: The Fuzzy Barrier: A Mechanism for High Speed Synchronization of Processors. Int'l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS-III) pp.54{6 (1989). 8. A. Fahmy, A. Heddaya: Communicable Memory and Lazy Barriers for Bulk Synchronous Parallelism in BSPk. Boston University Technical Report BU-CS (1996). 9. K. E. Schauser, C. J. Scheiman, J. M. Ferguson, P. Z. Kolano: Exploiting the Capability of Communications Co-processor. Int'l Parallel Processing Symposium (IPPS'96) (1996). 10. A. Krishnamurthy, K. E. Schauser, C. J. Scheiman, R. Y. Wang, D. E. Culler, K. Yelick: Evaluation of Architectural Support for Global Address-Based Communication in Large-Scale Parallel Machines. Int'l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII) (1996). 11. L. Prylli, B. Tourancheau: BIP: a New Protocol Designed for High Performance Networking on Myrinet. IPPS Workshop on Personal Computer based Networks of Workstations (PC-NOW'98) (1998). 12. S. Pakin, M. Lauria, A. Chien: High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet. Supercomputing'95 (1995). This article was processed using the LaT E X macro package with LLNCS style

Switch. Switch. PU: Pentium Pro 200MHz Memory: 128MB Myricom Myrinet 100Base-T Ethernet

Switch. Switch. PU: Pentium Pro 200MHz Memory: 128MB Myricom Myrinet 100Base-T Ethernet COMPaS: A Pentium Pro PC-based SMP Cluster and its Experience Yoshio Tanaka 1, Motohiko Matsuda 1, Makoto Ando 1, Kazuto Kubota and Mitsuhisa Sato 1 Real World Computing Partnership fyoshio,matu,ando,kazuto,msatog@trc.rwcp.or.jp