The Lighweight Protocol CLIC on Gigabit Ethernet

Size: px

Start display at page:

Download "The Lighweight Protocol CLIC on Gigabit Ethernet"

Paul Higgins
5 years ago
Views:

1 The Lighweight Protocol on Gigabit Ethernet Díaz, A.F.; Ortega; J.; Cañas, A.; Fernández, F.J.; Anguita, M.; Prieto, A. Departamento de Arquitectura y Tecnología de Computadores University of Granada (Spain) {afdiaz,julio,acanas,jfernand,manguita}@atc.ugr.es,aprieto@ugr.es Abstract In Gigabit class networks, the physical transmission time is small compared to the time required to process the TCP/IP protocol stack. Thus, the usefulness of lightweight protocols that reduce the communication software overhead is even higher, as the performance demands shift to the network interface hardware and/or software. The communication protocol has recently been proposed for efficient communication in clusters using the Linux operating system. It uses an approach to optimize the communication performance in a cluster of computers different from that of providing a user-level interface that removes the operating system (OS) mediation in the communication path. Thus, besides reducing the latencies and increasing the bandwidth figures even for short messages, meets other requirements such as multiprogramming, portability, protection against corrupted programs, reliable message delivery, direct access to the network for all applications, etc., by providing an optimized OS support to reliable and efficient network software, that avoids the TCP/IP protocol stack. In this paper we show how new facilities and improved performance of network interface cards (s) have been incorporated to in order to take advantage of the gigabit network technology. 1. Introduction The use of clusters of personal computers or workstations as high-performance parallel platforms poses problems related to communication between processors, which may limit their use to coarse grain applications. Although network technology is moving towards gigabits and tens of gigabits per second, it is difficult for applications to take advantages of this improvement. The gap between the user communication requirements of in-order and reliable message delivery and deadlock safety, and network features such as arbitrary delivery order, limited faulthandling, and finite buffering capabilities makes it necessary to have layers of protocols that provide the communication services not supported by the network hardware, but required by the applications [1]. In this way, protocols such as TCP/IP cause an overhead that represents an important amount of the communication cost. This inefficiency is even worse in gigabit networks, such as Myrinet and Gigabit Ethernet where the physical transmission time is negligible compared to the time spent to process the communication protocols [2]: while the processors reach gigahertz speeds and the networks provide gigabit bandwidths, the I/O buses (usually, PCI buses) have become the bottleneck in the communication paths. The main approaches adopted to reduce this software overhead have been the improvement of the TCP/IP layers [3,4] and the substitution of the TCP/IP layers by alternative ones [4-6]. With respect to this latter approach, two alternatives can also be considered: communication layers with efficient OS support [6, 12], and user-level communication interfaces [7-10]. This last alternative attempts to remove the OS mediation in the communication path to avoid system calls for communications, and provide a closer interaction between the parallel application and the. Recently, several companies, such as Intel, IBM, Microsoft, Tandem and Compaq, defined the Virtual Interface Architecture (VIA) [16], a set of user-level interface specifications with the goal of being recognized as a de facto standard for communication in clusters. VIA incorporates influences from previous proposed user-level interfaces, such as U- Net, AM-II, and VMMC [16]. Instead, other communication protocols such as GAMMA [2,6] and [12,13] substitute the TCP/IP layers, and improve the communication performance with an efficient OS support for communication. Section 2 describes the new situation created by the gigabit network technology and some improvements in the network interface cards (s) to get advantage of this gigabit technology. Section 3 describes the main

2 characteristics of the proposed protocol along with the modifications that have been included to scale its performance to gigabit networks, more specifically to Gigabit Ethernet. In this section, we also compare some characteristics of with those of VIA and GAMMA. Then, Section 4 gives details about the experimental results corresponding to the performance evaluation of. Finally, Section 5 states the conclusions of the paper. 2. Gigabit networks and network interfaces When communication is done through a gigabit class network, the bottleneck in the communication path can go from the network to other parts of the system, as we show in what follows. User Buffer Kernel Output Buffer Main Memory North Bridge PCI Bus Network Interface Network output buffer Figure 1. Differente paths to transfer data to the Figure 1 describes four possible paths to transfer the data to be sent from the user memory to the network interface buffer. From this buffer, the data is sent to the receiver through the network. In path number 1, data are copied directly from user memory to the network interface. In alternative 2, firstly data are copied into the output buffer of the network card and then they are copied to the network interface by the own processor of the. Thus two copies are required in this case. The path number 3 also requires two copies. The first copy, made by the host processor, moves the data from the user buffer to a kernel buffer. The second copy transfers the data directly from the kernel buffer to the network interface. Finally, the path 4 is similar to path 3, although one more copy is done as the data are first copied to the network output buffer and then they are transferred to the network interface. The first version of [13], implemented on Fast Ethernet, uses the alternative number 4, as one of the design requirements we considered was that the drivers of the could not be modified. This is an important difference between and other lightweight protocols such as GAMMA, and makes it possible to provide features related with portability, protection, multiprogramming, reliable message delivery, etc. Nevertheless, as the s (and their corresponding drivers) incorporate new features, it has been possible to improve by implementing 0-copy and other characteristics described in section 3. Thus, the version of for Gigabit Ethernet transfers the data through path number 2. The extension of to gigabit class networks poses new situations and interesting problems that need new strategies to optimize the communication protocol. Thus, although the data copies are an important element in the communication overhead, there are other factors, such as the interrupt processing, that constitute important bottlenecks in the communication path. These are factors that reduce the time the processor have to process the applications. For example, if we use the standard MTU (Maximum Transmission Unit) in Ethernet, i.e. about 1500 bytes, a Gigabit Ethernet will produce approximately one interrupt each 12 microseconds (1500 bytes 8 bits/byte 1 ns/bit). Although the time to process the interrupt can change according to the system characteristics, if we consider that the PCI 2.1 specification (33 MHz) considers delays of microseconds, even with a very optimized use of the OS, it would be very difficult to cope with such an interrupt rate. Moreover, not only the interrupt rate increases but also the TCP/IP headers to process through the protocol stack. In this way, in Fast Ethernet, with bandwidths of 100 Mb/s, it is possible to get a 90% of the maximum bandwidth with a 15-20% CPU use. Having a similar situation in networks with 1 Gb/s bandwidths would require almost a 100% of the processor power [11]. Taking into account the usual memory bus bandwidths, it would seem that the influence of the data copies on the overall communication bandwidths is not significant. Nevertheless, a copy uses system resources such as the memory and PCI buses, processor, etc. thus having influence in the global performance of system and applications. So, it is important to decrease the number of data copies, and many s for Gigabit Ethernet incorporate features allowing a straight data transference between user memory and the buffers. By using the corresponding call to the driver the message layer can take advantage of those new features. Besides the reduction of copies in the communication critical path, other alternatives that can improve the performance of the network interface are Jumbo frames, coalesced interrupts, and fragmentation in source and destination [11]. Jumbo frames allow the use of MTUs longer than the Ethernet standard 1500 bytes. Thus MTUs of up to 9000 bytes can be used to reduce the number of generated interrupts and the overhead associated with the communication protocol processing. Most and switch manufacturers provide Jumbo frames. However, these

3 frames affect the interoperability (both communicating computers have to use Jumbo frames), and are not scalable as they increment the time between frame arrivals only by a factor of six. With coalesced interrupts, the only interrupts the processor after a given time interval, or a given number of packets arrived. Although this technique reduces the number of generated interrupts, it delays the reception of messages. This situation is especially undesirable in case of small packets. Nevertheless, the drivers of present s usually allow the dynamic adjustment of time intervals in coalesced interrupts. Other technique that reduces the processor utilization and allows an adequate processing of long messages without penalizing the short ones is fragmentation. It consists in sending packets to the with sizes higher than the link MTU. The divides the packets according to the MTU size to send them, and it also assembles the received packets to build the packet that has to be sent to the application. In order to implement this technique the should include the corresponding features and it is necessary to make some slight modifications in the driver and an adequate programming of the firmware. In [11] it is described the use of fragmentation with the Alteon Acenic 2, that includes two 88 MHz MIPS R4000-like processors and 2 Mbytes of DRAM. In the present version of for Gigabit Ethernet, coalesced interrupt, 0-copy, and Jumbo frames have been implemented. As the use of fragmentation would require the modification of the driver, it has not been included in order to keep the portability of the system. Nevertheless, if portability is not an important issue for a given system, it would be interesting to use fragmentation to improve performance. The implementation of fragmentation as a selectable alternative is going to be considered in future versions of. As we will see in the next section, implements an optimal communication processing by the OS without modifying the drivers. Thus, it has been relatively easy to take advantage of the present trends towards more powerful network and I/O cards. The hardware included in such cards presents more functionality, reducing the CPU utilization in the communication path. 3. The protocol A detailed description of the first version of for Fast Ethernet is provided in [12,13], where a complete reference to other previously proposed works on the topic, along with their differences with respect to are also given. This section provides a brief description of the main characteristics of a new version of for Gigabit Ethernet. User processes Drivers Sockets TCP IP Gigabit Ethernet User Kernel Figure 2. Comparison of and TCP/IP layers 3.1 How woks is embedded in the Linux kernel and provides an interface to the user applications (Figure 2). The key to the communication improvement provided by is the reduction in the number of protocol layers, which decreases the software overhead and the number of data copies. For example, the IP layer is not necessary in a cluster of computers where the machines are connected by the same network, thus making it unnecessary to use IP protocols with routing. In this way, consists of a reliable transport protocol that interfaces with the and its corresponding driver. When an application executes a send, a system call is generated. For example, if the CPU is a processor with Intel architecture, this system call is implemented with the interrupt INT 80h (Figure 3 shows this case). This system call is labelled in Figure 3 as (1). The associated overhead to enter and leave the OS kernel through the system call is approximately 0.65 µs (in a PC running at 1.5 GHz). The generated system call activates the program _MODULE, which is inserted within the OS kernel. _MODULE composes the headers, actualizes the SK_BUFF structure and calls the driver ((2) in Figure 3). With respect to the headers, in Ethernet, among the three existing levels of headers (Level 1: Pure Ethernet; Level 2: LLC; and Level 3: IP), the level 1 header is used. This header consists of 14 bytes (6 to indicate the destination; 6 to indicate the origin of the message; and 2 to indicate the type of packet). Then, the header is added. This header has 12 bytes that indicate whether the packet is an MPI packet, an internal packet, a kernel function packet, etc.

4 User Aplicación de Application Usuario send 1 User Memoria de Memory Usuario User Aplicación de Application Usuario Memoria de Usuario 3 2 Módulo Control Driver Module 2 Memoria System de Sistema Memory System Memoria de Memory Sistema Emisor Sender Kernel Núcleo Kernel Núcleo Receptor Receiver Figure 3. Schematic of inner working The SK_BUFF structure used by the drivers allows a fragmented send, i.e. it is possible to send data which are not allocated in contiguous memory addresses. Thus, SK_BUFF includes the pointers to the headers and the data to be sent from the user space. The driver initialises the DMA transference of data (that will be moved by using the as a bus master) and finishes indicating to _MODULE if it is possible or not to send the data. If the data can be sent, the moves them from memory to the buffers by using the pointers stored in the SK_BUFF structure ((3) in Figure 3). In this circumstance, _MODULE and the driver can finish before the data transference starts, and free the CPU. Thus, the sender overhead is the time to execute _MODULE and the driver, and (almost) does not depend on the size of the message. If the data cannot be sent at the present moment, _MODULE copies the data in the system memory. This copy is part of the sender overhead for this message and the CPU consumes time to do it, nevertheless, this time is overlapped with the communication of other messages by the. Later, when data can be sent, they will be moved from system memory to the by using a DMA transference initialised by the driver, in which the acts as a bus master. Finally, the inserts these packets into the communication network. In our previous version of (for Fast Ethernet) a copy is required to move the data to be sent from the user memory towards a zone in the system memory where a SK_BUFF structure is implemented. After that, a header is also added to each packet. Then, _MODULE searches in the corresponding OS table for the pointer to 4 3 Módulo Control Driver int the driver and calls it. Then, the driver copies the data in the buffers. Figure 3 also illustrates message reception to make an immediate write of the message in the user memory of the receiver process (remote write). Nevertheless, the following description also deals with the reception of a message if the receiver calls a receive function. When a packet arrives, the of the receiver generates an interrupt (in the case of a PCI bus, the interruption assigned by this bus) as indicated in Figure 3 as (4). This interrupt starts the execution of the driver. The driver routine remains active until all the data stored in the buffers have been moved to system memory. When the packet has been transferred to the system memory ((5) in Figure 3), and after checking the type information codified in the last two bytes of the packet header, the bottom halves are checked and, since the corresponding request from the is pending, _MODULE is called (step (6) in Figure 3) to execute the function corresponding to the type of packet received. First, _MODULE checks if there is a process waiting for the corresponding packet. If so, _MODULE moves the data to the user memory of that process. Otherwise, the packet remains in the system memory. When a process calls a receive function for this packet, a lightweight system call is generated (INT 80h in the case of Pentium, as in the send process) indicating the corresponding location of the user memory where the data has to be transferred. To receive an asynchronous message (a remote write), _MODULE directly moves the packet from system memory to the corresponding user memory location without having to wait for any receive call (step (7) in Figure 3). _MODULE is called from the user process executing either a receive or a send. If the message has not arrived yet, _MODULE does nothing and returns. Then, the control passes to the receiver, which can proceed with the execution of other instructions (if the receive is non-blocking), or remains waiting for the corresponding message. In this case, the OS scheduler will proceed as necessary. With respect to the OS mediation in the communication path through system calls, although this produces an overhead, its magnitude is small (less than one microsecond) and it makes it possible to use all the services provided by the OS. For example, an efficient scheduler that uses in realistic (multi-user, multitasking) conditions is directly applicable.

5 3.2 Comparison with VIA Here, we give some details of the interfaces VIA and GAMMA, to put in the context of the research done in this field. VIA [16] introduces the concept of virtual interface (VI). Each process opens a VI to communicate with each other. There are two queues associated to each VI (a queue for receiving messages and a queue for sending messages) with message descriptors organized as linked lists. Thus, each descriptor points to one or more buffer descriptors. To send a message, a given application adds a new message descriptor at the tail of the queue for sending messages. After sending a message, an end of transmission bit is set in the descriptor of the transmitted message and the application extracts the descriptor from the queue when its head arrives. To receive a message, the application adds, at the end of the queue for sending messages, a descriptor of free buffers where the messages can be written as they arrive. VIA also allows direct transferences between local and remote memories (RDMA, remote DMA). In VIA, the reduction in the communication overhead is based on (1) avoiding the OS participation to multiplex the communication hardware between the processes; (2) eliminating the copies of the messages between different memory zones; and (3) not using interrupts. As we explain below, uses different strategies in the points (1) and (3): (a) The OS interaction with the network interface (NI) hardware. Processes in a computer share the hardware, and more specifically, the communication hardware. When protocols such as TCP/IP are used, the control of the access to the NI hardware is done by software running in kernel mode. Direct access to the hardware of the interface from the applications run in user mode is not allowed. They require a system call to run the corresponding OS routine. The processes also share the memory of the computer. In this case, however, the virtual memory system uses the corresponding page and segment tables, specific hardware to made the required address translations, etc., and avoids the software intervention in each memory access (with the corresponding overhead). Applying a similar strategy, VIA defines multiple virtual interfaces that can be directly used by the applications (in user mode) and changes the way communication is considered in the system: instead of considering it as a low frequent transaction between slow devices, it has a status similar to memory accesses. However, VIA does not guarantee a reliable communication. Instead, the application (not the communication system) has to care about reliability. Thus, the situation is similar to that of UDP/IP although reliable communication software for VIA is more elaborated, since copying data between different memory zones is not allowed [10]. relies on OS to access the network interface hardware. It would be possible to decrease the overhead associated to switching between user and system modes by using lighweight calls as in other communication systems such as GAMMA [2,6,14,15]. In this case, when returning to user mode, the scheduler is not called and it is possible to save some amount of time. does not use this type of calls because, when there are several requests for pending messages, the intervention of the scheduler makes it possible to attend these messages faster. Moreover, the amount of time required to switch from user to kernel mode (or vice versa) is about 0.65 µs (less than 2% of the time required to send a message). The improvement introduced by with respect to the role of the OS in the communication process comes from the reduction in the complexity associated to the TCP/IP protocols suite. As the communication protocol is simpler, it requires a smaller header and it is possible to reduce the time required to build the packets and to increment the effective bandwidth. (b) Interrupts. VIA uses polling to determine if a message has arrived. Thus, the processor consumes cycles while it waits for messages to be received. If most messages are short, this wait time is not very high and, if they correspond to coordination messages, the processor cannot execute other instructions of the process (if there are more processes waiting in the processor, the control could pass to one of them). However, if polling is implemented through accesses to the network interface card by using I/O transactions, the time spent can be too high, thus decreasing the effective bandwidth and increasing the delay in the transmission of packets between the network interface card and the main memory. So, when polling is used, the polling frequency must be carefully selected in order to keep the corresponding overhead as low as possible. In, interrupts are used since we consider that they represent the right way to manage the asynchronous events. The interrupt latency represents a significant part of the message latency (about 20 µs). However, frequently it is not necessary to attend one interrupt per packet because when the routine that transfers the packets is executed, it moves all the pending packets. Moreover, to reduce the time the CPU has to spend processing communication tasks, uses coalesced interrupts.

6 Mbps 0 - copy MTU copy MTU ,E+01 1,E+02 1,E+03 1,E+04 1,E+05 1,E+06 Size (bytes) Figure 4. Bandwidths with for different MTUs and 0 or 1-copy 4. Experimental results 0 - copy MTU copy MTU ,E+07 We have evaluated the effect of the 0-copy for different MTU sizes, also considering the use of Jumbo frames (MTU=9000 bytes) against the standard Ethernet MTU=1500 bytes. The results for and TCP/IP are shown in Figure 4. The SMC9462TX and 3C996-T Gigabit Ethernet/PCI s have been used to obtain these results. The PCI buses of the connected computers are 33 MHz 32 bits buses. All the experiments use the coalesced interrupts provided by the s. As it is shown in Figure 4, although Jumbo frames and 0-copy provide better performance, the improvement achieved by Jumbo frames is higher than the one obtained by 0-copy. Moreover, the effect of 0-copy is higher with 1500 bytes frames than with 9000 bytes ones. Figure 5 compares the performance of and TCP/IP for different values of MTU. In all cases, 0-copy has been used. As it is shown, provides more than twofold increment in the bandwidths achieved by TCP/IP in the best case for TCP/IP (MTU=9000 bytes). Moreover, allows the bandwidth versus packet size curves to rise faster than with TCP/IP. Figure 6 compares the performance of, an implementation of MPI on (MPI-) and implementations of MPI and PVM on TCP/IP (noted as MPI and PVM in the figure). As it is shown, the bandwidths of and MPI- are higher than those provided by MPI and PVM on TCP/IP. In the worst case, for long messages, MPI on provides 1.5 times the bandwidth of MPI on TCP. The rise of bandwidths with respect to the packet size is faster for and MPI Mbps TCP 9000 TCP E+01 E+02 E+03 E+04 E+05 E+06 E+07 Size (bytes) Figure 5. vs. TCP/IP for different MTUs (9000 and 1500) The latency for short messages (0 bytes) is 36 microseconds, and 50% of the bandwidth is reached for packets of 4 Kbytes with, and approximately 16 Kbytes with TCP/IP MPI- 150 MPI Mbps PVM 0 1,E+01 1,E+02 1,E+03 1,E+04 1,E+05 1,E+06 1,E+07 Tamaño (bytes) Figure 6. Bandwidths for, MPI-, MPI (on TCP/IP), and PVM (on TCP/IP) Figure 7.a shows the timing measurements of a 1400 Bytes packet flowing through the communication path defined by. As it can be seen, the slowest stage corresponds to the driver processing in the receiver after accepting the interrupt generated by the, during the reception of the packet. As it is shown in Figure 8.a, the driver calls a routine that creates SK_BUFF in system memory and copies the data from the network interface to this location. At the end of this routine, a call to _MODULE is done through the Linux bottom halves, and the data will be transferred from system to

7 user memory. Thus, one copy is still required in the receiver side. A clear improvement on the communication latency could be obtained whenever the driver would be able to make a direct call to _MODULE (Figure 8.b). Then, _MODULE would copy the data from the to the user memory. In this case, the interrupt latency could be reduced approximately from 20µs to 5 µs, as it is shown in Figure 7.b. _ MODULE Driver Flight time _ MODULE Driver Memory + PCI buses Flight time Network 0.7+4µs 5µs Memory + PCI buses Network Driver int Driver: int Receiver Memory + PCI buses _ MODULE Receiver Memory + PCI buses Driver: BH 15µs 2µs _ MODULE (a) (b) Figure 7. Timing measurements of a 1400 Bytes packet flowing through the pipeline (a). Timing for the improvements shown in Figure 8 (b) (a) (b) INT Driver Driver Routine that creates SK_BUFF in system memory and moves data from System Memory User Memory Scheduler Bottom Halves: call to _MODULE Bottom Halves: call to _MODULE Scheduler Figure 8. Transference of a packet in the receiver by using the bottom halves (a). Scheme of the implemented improvement (b) 5. Conclusions The message layer is based on an efficient operating system support for communications. It substitutes the TCP and IP layers in the TCP/IP architecture providing a reliable transport protocol with performance as close as possible to that of the hardware. MPI and PVM point-topoint communication functions can be easily mapped to reliable point-to-point communications provided by the layer. It also takes advantage of the multicast/broadcast capabilities offered by the Ethernet data-link layer, on top of which is built. This paper describes the changes done in the lightweight protocol to take advantage of the gigabit network technology and the new features included in the network interface cards. The experimental results obtained by our implementation of on Gigabit Ethernet are also provided, and show an important improvement over TCP/IP. The results can be summarized as a minimum latency of 36 microseconds, and an asymptotic value for the bandwidth of about 600 Mbits/s with MTU=9000 bytes and about 450 Mbits/s with MTU=1500 bytes. With MPI over, the bandwidth provided is, in the worst case, 1.5 the bandwidth of MPI over TCP/IP. The 50% of the maximum bandwidth provided by the network, is obtained for packet size of 4 KBytes (16 KBytes with TCP/IP). Moreover, the approach followed to develop, i.e. to optimize the communication protocols without modifying the network drivers, has been demonstrated as correct. This way, the new features implemented in the present s, which are essential to take advantage of the bandwidths provided by Gigabit Ethernet, such as Jumbo frames, coalesced interrupts, and 0-copy, have been quickly included in. Compared with GAMMA [2], provides higher values for latencies (36 µs in vs. 32µs in GAMMA with GA620 and 9.5µs with GII), and a slightly lower bandwidth (about 600 Mbits/s in vs. 768 Mbits/s in GAMMA with GII and 824 Mbits/s with GA620). Nevertheless, although improves the communication bandwidths and latencies, it has other interesting features: It depends neither on the network interface card nor on the processor architecture. Thus can be ported to any system running the Linux OS without requiring any modification in the drivers. The code is re-entrant. This allows the use of threads and the use of in systems where several processes attempt to access the OS kernel. This is very interesting for clusters of multiprocessors. To provide an efficient implementation of in this environment, the critical sections have been reduced. provides primitives to send messages with confirmation of reception. It also has primitives for synchronous and asynchronous communication.

8 allows communication between processes running on the same processor. In other communication layers proposed in the literature, it is not possible to send messages between processes in the same processor. allows the use of several network cards to increase the communication bandwidth when a switch is used to build the network (channel bonding). An efficient LAM-MPI implementation on top of has been also developed [12]. The results obtained show an improvement in the communication performance provided by with respect to the implementation of LAM-MPI using the TCP/IP protocols. Acknowledgements. This paper has been supported by the Spanish Ministerio de Ciencia y Tecnología, under grant TIC References [1] Karamcheti, V.; Chien, A.A.:"Software overhead in messaging layers: where does time go?". Proc. of ASPLOS-VI, San Jose (California), October 5-7, [2] Ciaccio, G.: Messaging on Gigabit Ethernet: Some experiments with GAMMA and other Systems. IPDPS (International Parallel and Distributed Processing Symp.) 2001, San Francisco, CA, April, [3] Flores, A.; García, J.M.:"Assessing the performance of communication layer in a cluster of workstations". Jornadas de Paralelismo, pp , [4] Sterling, T.; et al.:"beowulf: A parallel workstation for scientific computation". Proc. 24th Int. Conf. on Parallel Processing, August, [5] Welsh, M.; Basu, A.; Eicken, T. von:"low-latency communication over Fast Ethernet". Proc. Euro-Par'96. August, [6] Chiola, G.; Ciaccio, G.:"Efficient parallel processing on low-cost clusters with GAMMA active ports". Parallel Computing, 26, pp , [7] Pakin, S.; Karacheti, V.; Chien, A.:"Fast Messages (FM): Efficient, Portable Communication for Workstation Clusters and Massively-Parallel Processors". IEEE Parallel and Distributed Technlogy, Vol.5, No.2. April/June, [8] Prylli, L.; Tourancheau, B.:"BIP: a new protocol designed for high performance networking on Myrinet". Workshop PC-NOW, IPPS/SPDP98, (Lecture Notes in Computer Science, No.1388), pp April, [9] Eicken, T. von; Basu, A.; Buch, V.; Vogels, W.:"U-Net: a user-level network interface for parallel and distributed computing. Proc. of the 15th ACM Symp. on Operating Systems Principles (SOSP'95). December, [10] Bhoedjang, R.A.F.; Rühl, T.; Bal, H.E.:"User-level Network Interface Protocols". IEEE Computer, pp November, [11] Gilfeather, P.; Underwood, T.: Fragmentation and High Performance IP. Workshop on Communication Architecture for Clusters, IPDPS, [12] Díaz, A.F.; Ortega, J.; Cañas, A.; Fernández, F.J.; Prieto, A.: The Lightweight Protocol : Performance of an MPI implementation on. IEEE International Conference on Cluster Computing (CLUSTER 2001), pp October, [13] Díaz, A. F.; Ortega, J.; Anguita, M.; Cañas, A.; Prieto, A.: An efficient OS support for Communication on Linux Clusters. Workshop on Scheduling and Resource Management for Cluster Computing, ICPP 2001 (International Conference on Parallel Processing). September, [14] Chiola, G.; Ciaccio, G.:"GAMMA: a low-cost network of workstations based on Active Messages". Euromicro, pp.78-83, [15] Chiola, G.; Ciaccio, G.: Porting MPICH ADI on GAMMA with Flow Control Midwest Workshop on Parallel Processing, Ohio, August, 11-13, [16] von Eicken, T.; Vogels, W.: Evolution of the Virtual Interface Architecture. IEEE Computer, pp November, 1998.

Lightweight Real-time Network Communication Protocol for Commodity Cluster Systems

Lightweight Real-time Network Communication Protocol for Commodity Cluster Systems Hai Jin, Minghu Zhang, Pengliu Tan, Hanhua Chen, Li Xu Cluster and Grid Computing Lab. Huazhong University of Science