The Lighweight Protocol CLIC on Gigabit Ethernet

Size: px
Start display at page:

Download "The Lighweight Protocol CLIC on Gigabit Ethernet"

Transcription

1 The Lighweight Protocol on Gigabit Ethernet Díaz, A.F.; Ortega; J.; Cañas, A.; Fernández, F.J.; Anguita, M.; Prieto, A. Departamento de Arquitectura y Tecnología de Computadores University of Granada (Spain) {afdiaz,julio,acanas,jfernand,manguita}@atc.ugr.es,aprieto@ugr.es Abstract In Gigabit class networks, the physical transmission time is small compared to the time required to process the TCP/IP protocol stack. Thus, the usefulness of lightweight protocols that reduce the communication software overhead is even higher, as the performance demands shift to the network interface hardware and/or software. The communication protocol has recently been proposed for efficient communication in clusters using the Linux operating system. It uses an approach to optimize the communication performance in a cluster of computers different from that of providing a user-level interface that removes the operating system (OS) mediation in the communication path. Thus, besides reducing the latencies and increasing the bandwidth figures even for short messages, meets other requirements such as multiprogramming, portability, protection against corrupted programs, reliable message delivery, direct access to the network for all applications, etc., by providing an optimized OS support to reliable and efficient network software, that avoids the TCP/IP protocol stack. In this paper we show how new facilities and improved performance of network interface cards (s) have been incorporated to in order to take advantage of the gigabit network technology. 1. Introduction The use of clusters of personal computers or workstations as high-performance parallel platforms poses problems related to communication between processors, which may limit their use to coarse grain applications. Although network technology is moving towards gigabits and tens of gigabits per second, it is difficult for applications to take advantages of this improvement. The gap between the user communication requirements of in-order and reliable message delivery and deadlock safety, and network features such as arbitrary delivery order, limited faulthandling, and finite buffering capabilities makes it necessary to have layers of protocols that provide the communication services not supported by the network hardware, but required by the applications [1]. In this way, protocols such as TCP/IP cause an overhead that represents an important amount of the communication cost. This inefficiency is even worse in gigabit networks, such as Myrinet and Gigabit Ethernet where the physical transmission time is negligible compared to the time spent to process the communication protocols [2]: while the processors reach gigahertz speeds and the networks provide gigabit bandwidths, the I/O buses (usually, PCI buses) have become the bottleneck in the communication paths. The main approaches adopted to reduce this software overhead have been the improvement of the TCP/IP layers [3,4] and the substitution of the TCP/IP layers by alternative ones [4-6]. With respect to this latter approach, two alternatives can also be considered: communication layers with efficient OS support [6, 12], and user-level communication interfaces [7-10]. This last alternative attempts to remove the OS mediation in the communication path to avoid system calls for communications, and provide a closer interaction between the parallel application and the. Recently, several companies, such as Intel, IBM, Microsoft, Tandem and Compaq, defined the Virtual Interface Architecture (VIA) [16], a set of user-level interface specifications with the goal of being recognized as a de facto standard for communication in clusters. VIA incorporates influences from previous proposed user-level interfaces, such as U- Net, AM-II, and VMMC [16]. Instead, other communication protocols such as GAMMA [2,6] and [12,13] substitute the TCP/IP layers, and improve the communication performance with an efficient OS support for communication. Section 2 describes the new situation created by the gigabit network technology and some improvements in the network interface cards (s) to get advantage of this gigabit technology. Section 3 describes the main

2 characteristics of the proposed protocol along with the modifications that have been included to scale its performance to gigabit networks, more specifically to Gigabit Ethernet. In this section, we also compare some characteristics of with those of VIA and GAMMA. Then, Section 4 gives details about the experimental results corresponding to the performance evaluation of. Finally, Section 5 states the conclusions of the paper. 2. Gigabit networks and network interfaces When communication is done through a gigabit class network, the bottleneck in the communication path can go from the network to other parts of the system, as we show in what follows. User Buffer Kernel Output Buffer Main Memory North Bridge PCI Bus Network Interface Network output buffer Figure 1. Differente paths to transfer data to the Figure 1 describes four possible paths to transfer the data to be sent from the user memory to the network interface buffer. From this buffer, the data is sent to the receiver through the network. In path number 1, data are copied directly from user memory to the network interface. In alternative 2, firstly data are copied into the output buffer of the network card and then they are copied to the network interface by the own processor of the. Thus two copies are required in this case. The path number 3 also requires two copies. The first copy, made by the host processor, moves the data from the user buffer to a kernel buffer. The second copy transfers the data directly from the kernel buffer to the network interface. Finally, the path 4 is similar to path 3, although one more copy is done as the data are first copied to the network output buffer and then they are transferred to the network interface. The first version of [13], implemented on Fast Ethernet, uses the alternative number 4, as one of the design requirements we considered was that the drivers of the could not be modified. This is an important difference between and other lightweight protocols such as GAMMA, and makes it possible to provide features related with portability, protection, multiprogramming, reliable message delivery, etc. Nevertheless, as the s (and their corresponding drivers) incorporate new features, it has been possible to improve by implementing 0-copy and other characteristics described in section 3. Thus, the version of for Gigabit Ethernet transfers the data through path number 2. The extension of to gigabit class networks poses new situations and interesting problems that need new strategies to optimize the communication protocol. Thus, although the data copies are an important element in the communication overhead, there are other factors, such as the interrupt processing, that constitute important bottlenecks in the communication path. These are factors that reduce the time the processor have to process the applications. For example, if we use the standard MTU (Maximum Transmission Unit) in Ethernet, i.e. about 1500 bytes, a Gigabit Ethernet will produce approximately one interrupt each 12 microseconds (1500 bytes 8 bits/byte 1 ns/bit). Although the time to process the interrupt can change according to the system characteristics, if we consider that the PCI 2.1 specification (33 MHz) considers delays of microseconds, even with a very optimized use of the OS, it would be very difficult to cope with such an interrupt rate. Moreover, not only the interrupt rate increases but also the TCP/IP headers to process through the protocol stack. In this way, in Fast Ethernet, with bandwidths of 100 Mb/s, it is possible to get a 90% of the maximum bandwidth with a 15-20% CPU use. Having a similar situation in networks with 1 Gb/s bandwidths would require almost a 100% of the processor power [11]. Taking into account the usual memory bus bandwidths, it would seem that the influence of the data copies on the overall communication bandwidths is not significant. Nevertheless, a copy uses system resources such as the memory and PCI buses, processor, etc. thus having influence in the global performance of system and applications. So, it is important to decrease the number of data copies, and many s for Gigabit Ethernet incorporate features allowing a straight data transference between user memory and the buffers. By using the corresponding call to the driver the message layer can take advantage of those new features. Besides the reduction of copies in the communication critical path, other alternatives that can improve the performance of the network interface are Jumbo frames, coalesced interrupts, and fragmentation in source and destination [11]. Jumbo frames allow the use of MTUs longer than the Ethernet standard 1500 bytes. Thus MTUs of up to 9000 bytes can be used to reduce the number of generated interrupts and the overhead associated with the communication protocol processing. Most and switch manufacturers provide Jumbo frames. However, these

3 frames affect the interoperability (both communicating computers have to use Jumbo frames), and are not scalable as they increment the time between frame arrivals only by a factor of six. With coalesced interrupts, the only interrupts the processor after a given time interval, or a given number of packets arrived. Although this technique reduces the number of generated interrupts, it delays the reception of messages. This situation is especially undesirable in case of small packets. Nevertheless, the drivers of present s usually allow the dynamic adjustment of time intervals in coalesced interrupts. Other technique that reduces the processor utilization and allows an adequate processing of long messages without penalizing the short ones is fragmentation. It consists in sending packets to the with sizes higher than the link MTU. The divides the packets according to the MTU size to send them, and it also assembles the received packets to build the packet that has to be sent to the application. In order to implement this technique the should include the corresponding features and it is necessary to make some slight modifications in the driver and an adequate programming of the firmware. In [11] it is described the use of fragmentation with the Alteon Acenic 2, that includes two 88 MHz MIPS R4000-like processors and 2 Mbytes of DRAM. In the present version of for Gigabit Ethernet, coalesced interrupt, 0-copy, and Jumbo frames have been implemented. As the use of fragmentation would require the modification of the driver, it has not been included in order to keep the portability of the system. Nevertheless, if portability is not an important issue for a given system, it would be interesting to use fragmentation to improve performance. The implementation of fragmentation as a selectable alternative is going to be considered in future versions of. As we will see in the next section, implements an optimal communication processing by the OS without modifying the drivers. Thus, it has been relatively easy to take advantage of the present trends towards more powerful network and I/O cards. The hardware included in such cards presents more functionality, reducing the CPU utilization in the communication path. 3. The protocol A detailed description of the first version of for Fast Ethernet is provided in [12,13], where a complete reference to other previously proposed works on the topic, along with their differences with respect to are also given. This section provides a brief description of the main characteristics of a new version of for Gigabit Ethernet. User processes Drivers Sockets TCP IP Gigabit Ethernet User Kernel Figure 2. Comparison of and TCP/IP layers 3.1 How woks is embedded in the Linux kernel and provides an interface to the user applications (Figure 2). The key to the communication improvement provided by is the reduction in the number of protocol layers, which decreases the software overhead and the number of data copies. For example, the IP layer is not necessary in a cluster of computers where the machines are connected by the same network, thus making it unnecessary to use IP protocols with routing. In this way, consists of a reliable transport protocol that interfaces with the and its corresponding driver. When an application executes a send, a system call is generated. For example, if the CPU is a processor with Intel architecture, this system call is implemented with the interrupt INT 80h (Figure 3 shows this case). This system call is labelled in Figure 3 as (1). The associated overhead to enter and leave the OS kernel through the system call is approximately 0.65 µs (in a PC running at 1.5 GHz). The generated system call activates the program _MODULE, which is inserted within the OS kernel. _MODULE composes the headers, actualizes the SK_BUFF structure and calls the driver ((2) in Figure 3). With respect to the headers, in Ethernet, among the three existing levels of headers (Level 1: Pure Ethernet; Level 2: LLC; and Level 3: IP), the level 1 header is used. This header consists of 14 bytes (6 to indicate the destination; 6 to indicate the origin of the message; and 2 to indicate the type of packet). Then, the header is added. This header has 12 bytes that indicate whether the packet is an MPI packet, an internal packet, a kernel function packet, etc.

4 User Aplicación de Application Usuario send 1 User Memoria de Memory Usuario User Aplicación de Application Usuario Memoria de Usuario 3 2 Módulo Control Driver Module 2 Memoria System de Sistema Memory System Memoria de Memory Sistema Emisor Sender Kernel Núcleo Kernel Núcleo Receptor Receiver Figure 3. Schematic of inner working The SK_BUFF structure used by the drivers allows a fragmented send, i.e. it is possible to send data which are not allocated in contiguous memory addresses. Thus, SK_BUFF includes the pointers to the headers and the data to be sent from the user space. The driver initialises the DMA transference of data (that will be moved by using the as a bus master) and finishes indicating to _MODULE if it is possible or not to send the data. If the data can be sent, the moves them from memory to the buffers by using the pointers stored in the SK_BUFF structure ((3) in Figure 3). In this circumstance, _MODULE and the driver can finish before the data transference starts, and free the CPU. Thus, the sender overhead is the time to execute _MODULE and the driver, and (almost) does not depend on the size of the message. If the data cannot be sent at the present moment, _MODULE copies the data in the system memory. This copy is part of the sender overhead for this message and the CPU consumes time to do it, nevertheless, this time is overlapped with the communication of other messages by the. Later, when data can be sent, they will be moved from system memory to the by using a DMA transference initialised by the driver, in which the acts as a bus master. Finally, the inserts these packets into the communication network. In our previous version of (for Fast Ethernet) a copy is required to move the data to be sent from the user memory towards a zone in the system memory where a SK_BUFF structure is implemented. After that, a header is also added to each packet. Then, _MODULE searches in the corresponding OS table for the pointer to 4 3 Módulo Control Driver int the driver and calls it. Then, the driver copies the data in the buffers. Figure 3 also illustrates message reception to make an immediate write of the message in the user memory of the receiver process (remote write). Nevertheless, the following description also deals with the reception of a message if the receiver calls a receive function. When a packet arrives, the of the receiver generates an interrupt (in the case of a PCI bus, the interruption assigned by this bus) as indicated in Figure 3 as (4). This interrupt starts the execution of the driver. The driver routine remains active until all the data stored in the buffers have been moved to system memory. When the packet has been transferred to the system memory ((5) in Figure 3), and after checking the type information codified in the last two bytes of the packet header, the bottom halves are checked and, since the corresponding request from the is pending, _MODULE is called (step (6) in Figure 3) to execute the function corresponding to the type of packet received. First, _MODULE checks if there is a process waiting for the corresponding packet. If so, _MODULE moves the data to the user memory of that process. Otherwise, the packet remains in the system memory. When a process calls a receive function for this packet, a lightweight system call is generated (INT 80h in the case of Pentium, as in the send process) indicating the corresponding location of the user memory where the data has to be transferred. To receive an asynchronous message (a remote write), _MODULE directly moves the packet from system memory to the corresponding user memory location without having to wait for any receive call (step (7) in Figure 3). _MODULE is called from the user process executing either a receive or a send. If the message has not arrived yet, _MODULE does nothing and returns. Then, the control passes to the receiver, which can proceed with the execution of other instructions (if the receive is non-blocking), or remains waiting for the corresponding message. In this case, the OS scheduler will proceed as necessary. With respect to the OS mediation in the communication path through system calls, although this produces an overhead, its magnitude is small (less than one microsecond) and it makes it possible to use all the services provided by the OS. For example, an efficient scheduler that uses in realistic (multi-user, multitasking) conditions is directly applicable.

5 3.2 Comparison with VIA Here, we give some details of the interfaces VIA and GAMMA, to put in the context of the research done in this field. VIA [16] introduces the concept of virtual interface (VI). Each process opens a VI to communicate with each other. There are two queues associated to each VI (a queue for receiving messages and a queue for sending messages) with message descriptors organized as linked lists. Thus, each descriptor points to one or more buffer descriptors. To send a message, a given application adds a new message descriptor at the tail of the queue for sending messages. After sending a message, an end of transmission bit is set in the descriptor of the transmitted message and the application extracts the descriptor from the queue when its head arrives. To receive a message, the application adds, at the end of the queue for sending messages, a descriptor of free buffers where the messages can be written as they arrive. VIA also allows direct transferences between local and remote memories (RDMA, remote DMA). In VIA, the reduction in the communication overhead is based on (1) avoiding the OS participation to multiplex the communication hardware between the processes; (2) eliminating the copies of the messages between different memory zones; and (3) not using interrupts. As we explain below, uses different strategies in the points (1) and (3): (a) The OS interaction with the network interface (NI) hardware. Processes in a computer share the hardware, and more specifically, the communication hardware. When protocols such as TCP/IP are used, the control of the access to the NI hardware is done by software running in kernel mode. Direct access to the hardware of the interface from the applications run in user mode is not allowed. They require a system call to run the corresponding OS routine. The processes also share the memory of the computer. In this case, however, the virtual memory system uses the corresponding page and segment tables, specific hardware to made the required address translations, etc., and avoids the software intervention in each memory access (with the corresponding overhead). Applying a similar strategy, VIA defines multiple virtual interfaces that can be directly used by the applications (in user mode) and changes the way communication is considered in the system: instead of considering it as a low frequent transaction between slow devices, it has a status similar to memory accesses. However, VIA does not guarantee a reliable communication. Instead, the application (not the communication system) has to care about reliability. Thus, the situation is similar to that of UDP/IP although reliable communication software for VIA is more elaborated, since copying data between different memory zones is not allowed [10]. relies on OS to access the network interface hardware. It would be possible to decrease the overhead associated to switching between user and system modes by using lighweight calls as in other communication systems such as GAMMA [2,6,14,15]. In this case, when returning to user mode, the scheduler is not called and it is possible to save some amount of time. does not use this type of calls because, when there are several requests for pending messages, the intervention of the scheduler makes it possible to attend these messages faster. Moreover, the amount of time required to switch from user to kernel mode (or vice versa) is about 0.65 µs (less than 2% of the time required to send a message). The improvement introduced by with respect to the role of the OS in the communication process comes from the reduction in the complexity associated to the TCP/IP protocols suite. As the communication protocol is simpler, it requires a smaller header and it is possible to reduce the time required to build the packets and to increment the effective bandwidth. (b) Interrupts. VIA uses polling to determine if a message has arrived. Thus, the processor consumes cycles while it waits for messages to be received. If most messages are short, this wait time is not very high and, if they correspond to coordination messages, the processor cannot execute other instructions of the process (if there are more processes waiting in the processor, the control could pass to one of them). However, if polling is implemented through accesses to the network interface card by using I/O transactions, the time spent can be too high, thus decreasing the effective bandwidth and increasing the delay in the transmission of packets between the network interface card and the main memory. So, when polling is used, the polling frequency must be carefully selected in order to keep the corresponding overhead as low as possible. In, interrupts are used since we consider that they represent the right way to manage the asynchronous events. The interrupt latency represents a significant part of the message latency (about 20 µs). However, frequently it is not necessary to attend one interrupt per packet because when the routine that transfers the packets is executed, it moves all the pending packets. Moreover, to reduce the time the CPU has to spend processing communication tasks, uses coalesced interrupts.

6 Mbps 0 - copy MTU copy MTU ,E+01 1,E+02 1,E+03 1,E+04 1,E+05 1,E+06 Size (bytes) Figure 4. Bandwidths with for different MTUs and 0 or 1-copy 4. Experimental results 0 - copy MTU copy MTU ,E+07 We have evaluated the effect of the 0-copy for different MTU sizes, also considering the use of Jumbo frames (MTU=9000 bytes) against the standard Ethernet MTU=1500 bytes. The results for and TCP/IP are shown in Figure 4. The SMC9462TX and 3C996-T Gigabit Ethernet/PCI s have been used to obtain these results. The PCI buses of the connected computers are 33 MHz 32 bits buses. All the experiments use the coalesced interrupts provided by the s. As it is shown in Figure 4, although Jumbo frames and 0-copy provide better performance, the improvement achieved by Jumbo frames is higher than the one obtained by 0-copy. Moreover, the effect of 0-copy is higher with 1500 bytes frames than with 9000 bytes ones. Figure 5 compares the performance of and TCP/IP for different values of MTU. In all cases, 0-copy has been used. As it is shown, provides more than twofold increment in the bandwidths achieved by TCP/IP in the best case for TCP/IP (MTU=9000 bytes). Moreover, allows the bandwidth versus packet size curves to rise faster than with TCP/IP. Figure 6 compares the performance of, an implementation of MPI on (MPI-) and implementations of MPI and PVM on TCP/IP (noted as MPI and PVM in the figure). As it is shown, the bandwidths of and MPI- are higher than those provided by MPI and PVM on TCP/IP. In the worst case, for long messages, MPI on provides 1.5 times the bandwidth of MPI on TCP. The rise of bandwidths with respect to the packet size is faster for and MPI Mbps TCP 9000 TCP E+01 E+02 E+03 E+04 E+05 E+06 E+07 Size (bytes) Figure 5. vs. TCP/IP for different MTUs (9000 and 1500) The latency for short messages (0 bytes) is 36 microseconds, and 50% of the bandwidth is reached for packets of 4 Kbytes with, and approximately 16 Kbytes with TCP/IP MPI- 150 MPI Mbps PVM 0 1,E+01 1,E+02 1,E+03 1,E+04 1,E+05 1,E+06 1,E+07 Tamaño (bytes) Figure 6. Bandwidths for, MPI-, MPI (on TCP/IP), and PVM (on TCP/IP) Figure 7.a shows the timing measurements of a 1400 Bytes packet flowing through the communication path defined by. As it can be seen, the slowest stage corresponds to the driver processing in the receiver after accepting the interrupt generated by the, during the reception of the packet. As it is shown in Figure 8.a, the driver calls a routine that creates SK_BUFF in system memory and copies the data from the network interface to this location. At the end of this routine, a call to _MODULE is done through the Linux bottom halves, and the data will be transferred from system to

7 user memory. Thus, one copy is still required in the receiver side. A clear improvement on the communication latency could be obtained whenever the driver would be able to make a direct call to _MODULE (Figure 8.b). Then, _MODULE would copy the data from the to the user memory. In this case, the interrupt latency could be reduced approximately from 20µs to 5 µs, as it is shown in Figure 7.b. _ MODULE Driver Flight time _ MODULE Driver Memory + PCI buses Flight time Network 0.7+4µs 5µs Memory + PCI buses Network Driver int Driver: int Receiver Memory + PCI buses _ MODULE Receiver Memory + PCI buses Driver: BH 15µs 2µs _ MODULE (a) (b) Figure 7. Timing measurements of a 1400 Bytes packet flowing through the pipeline (a). Timing for the improvements shown in Figure 8 (b) (a) (b) INT Driver Driver Routine that creates SK_BUFF in system memory and moves data from System Memory User Memory Scheduler Bottom Halves: call to _MODULE Bottom Halves: call to _MODULE Scheduler Figure 8. Transference of a packet in the receiver by using the bottom halves (a). Scheme of the implemented improvement (b) 5. Conclusions The message layer is based on an efficient operating system support for communications. It substitutes the TCP and IP layers in the TCP/IP architecture providing a reliable transport protocol with performance as close as possible to that of the hardware. MPI and PVM point-topoint communication functions can be easily mapped to reliable point-to-point communications provided by the layer. It also takes advantage of the multicast/broadcast capabilities offered by the Ethernet data-link layer, on top of which is built. This paper describes the changes done in the lightweight protocol to take advantage of the gigabit network technology and the new features included in the network interface cards. The experimental results obtained by our implementation of on Gigabit Ethernet are also provided, and show an important improvement over TCP/IP. The results can be summarized as a minimum latency of 36 microseconds, and an asymptotic value for the bandwidth of about 600 Mbits/s with MTU=9000 bytes and about 450 Mbits/s with MTU=1500 bytes. With MPI over, the bandwidth provided is, in the worst case, 1.5 the bandwidth of MPI over TCP/IP. The 50% of the maximum bandwidth provided by the network, is obtained for packet size of 4 KBytes (16 KBytes with TCP/IP). Moreover, the approach followed to develop, i.e. to optimize the communication protocols without modifying the network drivers, has been demonstrated as correct. This way, the new features implemented in the present s, which are essential to take advantage of the bandwidths provided by Gigabit Ethernet, such as Jumbo frames, coalesced interrupts, and 0-copy, have been quickly included in. Compared with GAMMA [2], provides higher values for latencies (36 µs in vs. 32µs in GAMMA with GA620 and 9.5µs with GII), and a slightly lower bandwidth (about 600 Mbits/s in vs. 768 Mbits/s in GAMMA with GII and 824 Mbits/s with GA620). Nevertheless, although improves the communication bandwidths and latencies, it has other interesting features: It depends neither on the network interface card nor on the processor architecture. Thus can be ported to any system running the Linux OS without requiring any modification in the drivers. The code is re-entrant. This allows the use of threads and the use of in systems where several processes attempt to access the OS kernel. This is very interesting for clusters of multiprocessors. To provide an efficient implementation of in this environment, the critical sections have been reduced. provides primitives to send messages with confirmation of reception. It also has primitives for synchronous and asynchronous communication.

8 allows communication between processes running on the same processor. In other communication layers proposed in the literature, it is not possible to send messages between processes in the same processor. allows the use of several network cards to increase the communication bandwidth when a switch is used to build the network (channel bonding). An efficient LAM-MPI implementation on top of has been also developed [12]. The results obtained show an improvement in the communication performance provided by with respect to the implementation of LAM-MPI using the TCP/IP protocols. Acknowledgements. This paper has been supported by the Spanish Ministerio de Ciencia y Tecnología, under grant TIC References [1] Karamcheti, V.; Chien, A.A.:"Software overhead in messaging layers: where does time go?". Proc. of ASPLOS-VI, San Jose (California), October 5-7, [2] Ciaccio, G.: Messaging on Gigabit Ethernet: Some experiments with GAMMA and other Systems. IPDPS (International Parallel and Distributed Processing Symp.) 2001, San Francisco, CA, April, [3] Flores, A.; García, J.M.:"Assessing the performance of communication layer in a cluster of workstations". Jornadas de Paralelismo, pp , [4] Sterling, T.; et al.:"beowulf: A parallel workstation for scientific computation". Proc. 24th Int. Conf. on Parallel Processing, August, [5] Welsh, M.; Basu, A.; Eicken, T. von:"low-latency communication over Fast Ethernet". Proc. Euro-Par'96. August, [6] Chiola, G.; Ciaccio, G.:"Efficient parallel processing on low-cost clusters with GAMMA active ports". Parallel Computing, 26, pp , [7] Pakin, S.; Karacheti, V.; Chien, A.:"Fast Messages (FM): Efficient, Portable Communication for Workstation Clusters and Massively-Parallel Processors". IEEE Parallel and Distributed Technlogy, Vol.5, No.2. April/June, [8] Prylli, L.; Tourancheau, B.:"BIP: a new protocol designed for high performance networking on Myrinet". Workshop PC-NOW, IPPS/SPDP98, (Lecture Notes in Computer Science, No.1388), pp April, [9] Eicken, T. von; Basu, A.; Buch, V.; Vogels, W.:"U-Net: a user-level network interface for parallel and distributed computing. Proc. of the 15th ACM Symp. on Operating Systems Principles (SOSP'95). December, [10] Bhoedjang, R.A.F.; Rühl, T.; Bal, H.E.:"User-level Network Interface Protocols". IEEE Computer, pp November, [11] Gilfeather, P.; Underwood, T.: Fragmentation and High Performance IP. Workshop on Communication Architecture for Clusters, IPDPS, [12] Díaz, A.F.; Ortega, J.; Cañas, A.; Fernández, F.J.; Prieto, A.: The Lightweight Protocol : Performance of an MPI implementation on. IEEE International Conference on Cluster Computing (CLUSTER 2001), pp October, [13] Díaz, A. F.; Ortega, J.; Anguita, M.; Cañas, A.; Prieto, A.: An efficient OS support for Communication on Linux Clusters. Workshop on Scheduling and Resource Management for Cluster Computing, ICPP 2001 (International Conference on Parallel Processing). September, [14] Chiola, G.; Ciaccio, G.:"GAMMA: a low-cost network of workstations based on Active Messages". Euromicro, pp.78-83, [15] Chiola, G.; Ciaccio, G.: Porting MPICH ADI on GAMMA with Flow Control Midwest Workshop on Parallel Processing, Ohio, August, 11-13, [16] von Eicken, T.; Vogels, W.: Evolution of the Virtual Interface Architecture. IEEE Computer, pp November, 1998.

Lightweight Real-time Network Communication Protocol for Commodity Cluster Systems

Lightweight Real-time Network Communication Protocol for Commodity Cluster Systems Lightweight Real-time Network Communication Protocol for Commodity Cluster Systems Hai Jin, Minghu Zhang, Pengliu Tan, Hanhua Chen, Li Xu Cluster and Grid Computing Lab. Huazhong University of Science

More information

Lightweight Messages: True Zero-Copy Communication for Commodity Gigabit Ethernet*

Lightweight Messages: True Zero-Copy Communication for Commodity Gigabit Ethernet* Lightweight Messages: True Zero-Copy Communication for Commodity Gigabit Ethernet* Hai Jin, Minghu Zhang, and Pengliu Tan Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong

More information

An Extensible Message-Oriented Offload Model for High-Performance Applications

An Extensible Message-Oriented Offload Model for High-Performance Applications An Extensible Message-Oriented Offload Model for High-Performance Applications Patricia Gilfeather and Arthur B. Maccabe Scalable Systems Lab Department of Computer Science University of New Mexico pfeather@cs.unm.edu,

More information

Advanced Computer Networks. End Host Optimization

Advanced Computer Networks. End Host Optimization Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct

More information

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication John Markus Bjørndalen, Otto J. Anshus, Brian Vinter, Tore Larsen Department of Computer Science University

More information

Experience in Offloading Protocol Processing to a Programmable NIC

Experience in Offloading Protocol Processing to a Programmable NIC Experience in Offloading Protocol Processing to a Programmable NIC Arthur B. Maccabe, Wenbin Zhu Computer Science Department The University of New Mexico Albuquerque, NM 87131 Jim Otto, Rolf Riesen Scalable

More information

Low-Latency Communication over Fast Ethernet

Low-Latency Communication over Fast Ethernet Low-Latency Communication over Fast Ethernet Matt Welsh, Anindya Basu, and Thorsten von Eicken {mdw,basu,tve}@cs.cornell.edu Department of Computer Science Cornell University, Ithaca, NY 14853 http://www.cs.cornell.edu/info/projects/u-net

More information

EFFICIENT PARALLEL PROCESSING, PROGRAM DEVELOPMENT AND COMMUNICATION IN LOW-COST HIGH PERFORMANCE PLATFORMS

EFFICIENT PARALLEL PROCESSING, PROGRAM DEVELOPMENT AND COMMUNICATION IN LOW-COST HIGH PERFORMANCE PLATFORMS EFFICIENT PARALLEL PROCESSING, PROGRAM DEVELOPMENT AND COMMUNICATION IN LOW-COST HIGH PERFORMANCE PLATFORMS Anguita, M.; Cañas, A.; Díaz, A.F.; Fernández, F.J.; Ortega, J.; Prieto, A. Department of Computer

More information

6.9. Communicating to the Outside World: Cluster Networking

6.9. Communicating to the Outside World: Cluster Networking 6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and

More information

Virtual Interface Architecture over Myrinet. EEL Computer Architecture Dr. Alan D. George Project Final Report

Virtual Interface Architecture over Myrinet. EEL Computer Architecture Dr. Alan D. George Project Final Report Virtual Interface Architecture over Myrinet EEL5717 - Computer Architecture Dr. Alan D. George Project Final Report Department of Electrical and Computer Engineering University of Florida Edwin Hernandez

More information

AN O/S PERSPECTIVE ON NETWORKS Adem Efe Gencer 1. October 4 th, Department of Computer Science, Cornell University

AN O/S PERSPECTIVE ON NETWORKS Adem Efe Gencer 1. October 4 th, Department of Computer Science, Cornell University AN O/S PERSPECTIVE ON NETWORKS Adem Efe Gencer 1 October 4 th, 2012 1 Department of Computer Science, Cornell University Papers 2 Active Messages: A Mechanism for Integrated Communication and Control,

More information

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK [r.tasker@dl.ac.uk] DataTAG is a project sponsored by the European Commission - EU Grant IST-2001-32459

More information

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects

More information

To provide a faster path between applications

To provide a faster path between applications Cover Feature Evolution of the Virtual Interface Architecture The recent introduction of the VIA standard for cluster or system-area networks has opened the market for commercial user-level network interfaces.

More information

Ethan Kao CS 6410 Oct. 18 th 2011

Ethan Kao CS 6410 Oct. 18 th 2011 Ethan Kao CS 6410 Oct. 18 th 2011 Active Messages: A Mechanism for Integrated Communication and Control, Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik Schauser. In Proceedings

More information

An O/S perspective on networks: Active Messages and U-Net

An O/S perspective on networks: Active Messages and U-Net An O/S perspective on networks: Active Messages and U-Net Theo Jepsen Cornell University 17 October 2013 Theo Jepsen (Cornell University) CS 6410: Advanced Systems 17 October 2013 1 / 30 Brief History

More information

Multicast can be implemented here

Multicast can be implemented here MPI Collective Operations over IP Multicast? Hsiang Ann Chen, Yvette O. Carrasco, and Amy W. Apon Computer Science and Computer Engineering University of Arkansas Fayetteville, Arkansas, U.S.A fhachen,yochoa,aapong@comp.uark.edu

More information

Implementation and Analysis of Large Receive Offload in a Virtualized System

Implementation and Analysis of Large Receive Offload in a Virtualized System Implementation and Analysis of Large Receive Offload in a Virtualized System Takayuki Hatori and Hitoshi Oi The University of Aizu, Aizu Wakamatsu, JAPAN {s1110173,hitoshi}@u-aizu.ac.jp Abstract System

More information

Directed Point: An Efficient Communication Subsystem for Cluster Computing. Abstract

Directed Point: An Efficient Communication Subsystem for Cluster Computing. Abstract Directed Point: An Efficient Communication Subsystem for Cluster Computing Chun-Ming Lee, Anthony Tam, Cho-Li Wang The University of Hong Kong {cmlee+clwang+atctam}@cs.hku.hk Abstract In this paper, we

More information

Can User-Level Protocols Take Advantage of Multi-CPU NICs?

Can User-Level Protocols Take Advantage of Multi-CPU NICs? Can User-Level Protocols Take Advantage of Multi-CPU NICs? Piyush Shivam Dept. of Comp. & Info. Sci. The Ohio State University 2015 Neil Avenue Columbus, OH 43210 shivam@cis.ohio-state.edu Pete Wyckoff

More information

Performance of the MP_Lite message-passing library on Linux clusters

Performance of the MP_Lite message-passing library on Linux clusters Performance of the MP_Lite message-passing library on Linux clusters Dave Turner, Weiyi Chen and Ricky Kendall Scalable Computing Laboratory, Ames Laboratory, USA Abstract MP_Lite is a light-weight message-passing

More information

Low-Latency Message Passing on Workstation Clusters using SCRAMNet 1 2

Low-Latency Message Passing on Workstation Clusters using SCRAMNet 1 2 Low-Latency Message Passing on Workstation Clusters using SCRAMNet 1 2 Vijay Moorthy, Matthew G. Jacunski, Manoj Pillai,Peter, P. Ware, Dhabaleswar K. Panda, Thomas W. Page Jr., P. Sadayappan, V. Nagarajan

More information

440GX Application Note

440GX Application Note Overview of TCP/IP Acceleration Hardware January 22, 2008 Introduction Modern interconnect technology offers Gigabit/second (Gb/s) speed that has shifted the bottleneck in communication from the physical

More information

Lighting the Blue Touchpaper for UK e-science - Closing Conference of ESLEA Project The George Hotel, Edinburgh, UK March, 2007

Lighting the Blue Touchpaper for UK e-science - Closing Conference of ESLEA Project The George Hotel, Edinburgh, UK March, 2007 Working with 1 Gigabit Ethernet 1, The School of Physics and Astronomy, The University of Manchester, Manchester, M13 9PL UK E-mail: R.Hughes-Jones@manchester.ac.uk Stephen Kershaw The School of Physics

More information

Motivation CPUs can not keep pace with network

Motivation CPUs can not keep pace with network Deferred Segmentation For Wire-Speed Transmission of Large TCP Frames over Standard GbE Networks Bilic Hrvoye (Billy) Igor Chirashnya Yitzhak Birk Zorik Machulsky Technion - Israel Institute of technology

More information

Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing

Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing J. Flich, M. P. Malumbres, P. López and J. Duato Dpto. Informática de Sistemas y Computadores Universidad Politécnica

More information

Group Management Schemes for Implementing MPI Collective Communication over IP Multicast

Group Management Schemes for Implementing MPI Collective Communication over IP Multicast Group Management Schemes for Implementing MPI Collective Communication over IP Multicast Xin Yuan Scott Daniels Ahmad Faraj Amit Karwande Department of Computer Science, Florida State University, Tallahassee,

More information

EMP: Zero-copy OS-bypass NIC-driven Gigabit Ethernet Message Passing

EMP: Zero-copy OS-bypass NIC-driven Gigabit Ethernet Message Passing EMP: Zero-copy OS-bypass NIC-driven Gigabit Ethernet Message Passing Piyush Shivam Computer/Information Science The Ohio State University 2015 Neil Avenue Columbus, OH 43210 shivam@cis.ohio-state.edu Pete

More information

PM2: High Performance Communication Middleware for Heterogeneous Network Environments

PM2: High Performance Communication Middleware for Heterogeneous Network Environments PM2: High Performance Communication Middleware for Heterogeneous Network Environments Toshiyuki Takahashi, Shinji Sumimoto, Atsushi Hori, Hiroshi Harada, and Yutaka Ishikawa Real World Computing Partnership,

More information

Using Time Division Multiplexing to support Real-time Networking on Ethernet

Using Time Division Multiplexing to support Real-time Networking on Ethernet Using Time Division Multiplexing to support Real-time Networking on Ethernet Hariprasad Sampathkumar 25 th January 2005 Master s Thesis Defense Committee Dr. Douglas Niehaus, Chair Dr. Jeremiah James,

More information

EE108B Lecture 17 I/O Buses and Interfacing to CPU. Christos Kozyrakis Stanford University

EE108B Lecture 17 I/O Buses and Interfacing to CPU. Christos Kozyrakis Stanford University EE108B Lecture 17 I/O Buses and Interfacing to CPU Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b 1 Announcements Remaining deliverables PA2.2. today HW4 on 3/13 Lab4 on 3/19

More information

A First Implementation of In-Transit Buffers on Myrinet GM Software Λ

A First Implementation of In-Transit Buffers on Myrinet GM Software Λ A First Implementation of In-Transit Buffers on Myrinet GM Software Λ S. Coll, J. Flich, M. P. Malumbres, P. López, J. Duato and F.J. Mora Universidad Politécnica de Valencia Camino de Vera, 14, 46071

More information

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup Chapter 4 Routers with Tiny Buffers: Experiments This chapter describes two sets of experiments with tiny buffers in networks: one in a testbed and the other in a real network over the Internet2 1 backbone.

More information

Push-Pull Messaging: a high-performance communication mechanism for commodity SMP clusters

Push-Pull Messaging: a high-performance communication mechanism for commodity SMP clusters Title Push-Pull Messaging: a high-performance communication mechanism for commodity SMP clusters Author(s) Wong, KP; Wang, CL Citation International Conference on Parallel Processing Proceedings, Aizu-Wakamatsu

More information

QuickSpecs. HP Z 10GbE Dual Port Module. Models

QuickSpecs. HP Z 10GbE Dual Port Module. Models Overview Models Part Number: 1Ql49AA Introduction The is a 10GBASE-T adapter utilizing the Intel X722 MAC and X557-AT2 PHY pairing to deliver full line-rate performance, utilizing CAT 6A UTP cabling (or

More information

A Modular High Performance Implementation of the Virtual Interface Architecture

A Modular High Performance Implementation of the Virtual Interface Architecture A Modular High Performance Implementation of the Virtual Interface Architecture Patrick Bozeman Bill Saphir National Energy Research Scientific Computing Center (NERSC) Lawrence Berkeley National Laboratory

More information

An Extensible Message-Oriented Offload Model for High-Performance Applications

An Extensible Message-Oriented Offload Model for High-Performance Applications An Extensible Message-Oriented Offload Model for High-Performance Applications Patricia Gilfeather and Arthur B. Maccabe Scalable Systems Lab Department of Computer Science University of New Mexico pfeather@cs.unm.edu,

More information

Lessons learned from MPI

Lessons learned from MPI Lessons learned from MPI Patrick Geoffray Opinionated Senior Software Architect patrick@myri.com 1 GM design Written by hardware people, pre-date MPI. 2-sided and 1-sided operations: All asynchronous.

More information

A Simple and Efficient Mechanism to Prevent Saturation in Wormhole Networks Λ

A Simple and Efficient Mechanism to Prevent Saturation in Wormhole Networks Λ A Simple and Efficient Mechanism to Prevent Saturation in Wormhole Networks Λ E. Baydal, P. López and J. Duato Depto. Informática de Sistemas y Computadores Universidad Politécnica de Valencia, Camino

More information

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G 10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G Mohammad J. Rashti and Ahmad Afsahi Queen s University Kingston, ON, Canada 2007 Workshop on Communication Architectures

More information

The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook)

The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook) Workshop on New Visions for Large-Scale Networks: Research & Applications Vienna, VA, USA, March 12-14, 2001 The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook) Wu-chun Feng feng@lanl.gov

More information

08:End-host Optimizations. Advanced Computer Networks

08:End-host Optimizations. Advanced Computer Networks 08:End-host Optimizations 1 What today is about We've seen lots of datacenter networking Topologies Routing algorithms Transport What about end-systems? Transfers between CPU registers/cache/ram Focus

More information

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Dept. of Computer Science Florida State University Tallahassee, FL 32306 {karwande,xyuan}@cs.fsu.edu

More information

Virtualization, Xen and Denali

Virtualization, Xen and Denali Virtualization, Xen and Denali Susmit Shannigrahi November 9, 2011 Susmit Shannigrahi () Virtualization, Xen and Denali November 9, 2011 1 / 70 Introduction Virtualization is the technology to allow two

More information

[ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock.

[ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock. Buffering roblems [ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock. Input-buffer overflow Suppose a large

More information

Shared Address Space I/O: A Novel I/O Approach for System-on-a-Chip Networking

Shared Address Space I/O: A Novel I/O Approach for System-on-a-Chip Networking Shared Address Space I/O: A Novel I/O Approach for System-on-a-Chip Networking Di-Shi Sun and Douglas M. Blough School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA

More information

EVALUATING INFINIBAND PERFORMANCE WITH PCI EXPRESS

EVALUATING INFINIBAND PERFORMANCE WITH PCI EXPRESS EVALUATING INFINIBAND PERFORMANCE WITH PCI EXPRESS INFINIBAND HOST CHANNEL ADAPTERS (HCAS) WITH PCI EXPRESS ACHIEVE 2 TO 3 PERCENT LOWER LATENCY FOR SMALL MESSAGES COMPARED WITH HCAS USING 64-BIT, 133-MHZ

More information

RDMA-like VirtIO Network Device for Palacios Virtual Machines

RDMA-like VirtIO Network Device for Palacios Virtual Machines RDMA-like VirtIO Network Device for Palacios Virtual Machines Kevin Pedretti UNM ID: 101511969 CS-591 Special Topics in Virtualization May 10, 2012 Abstract This project developed an RDMA-like VirtIO network

More information

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic) I/O Systems Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) I/O Systems 1393/9/15 1 / 57 Motivation Amir H. Payberah (Tehran

More information

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC To Infiniband or Not Infiniband, One Site s s Perspective Steve Woods MCNC 1 Agenda Infiniband background Current configuration Base Performance Application performance experience Future Conclusions 2

More information

Performance Evaluation of InfiniBand with PCI Express

Performance Evaluation of InfiniBand with PCI Express Performance Evaluation of InfiniBand with PCI Express Jiuxing Liu Amith Mamidala Abhinav Vishnu Dhabaleswar K Panda Department of Computer and Science and Engineering The Ohio State University Columbus,

More information

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing?

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? J. Flich 1,P.López 1, M. P. Malumbres 1, J. Duato 1, and T. Rokicki 2 1 Dpto. Informática

More information

ECE 341. Lecture # 19

ECE 341. Lecture # 19 ECE 341 Lecture # 19 Instructor: Zeshan Chishti zeshan@ece.pdx.edu December 3, 2014 Portland State University Announcements Final exam is on Monday, December 8 from 5:30 PM to 7:20 PM Similar format and

More information

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA M. GAUS, G. R. JOUBERT, O. KAO, S. RIEDEL AND S. STAPEL Technical University of Clausthal, Department of Computer Science Julius-Albert-Str. 4, 38678

More information

Implementing TreadMarks over GM on Myrinet: Challenges, Design Experience, and Performance Evaluation

Implementing TreadMarks over GM on Myrinet: Challenges, Design Experience, and Performance Evaluation Implementing TreadMarks over GM on Myrinet: Challenges, Design Experience, and Performance Evaluation Ranjit Noronha and Dhabaleswar K. Panda Dept. of Computer and Information Science The Ohio State University

More information

Performance Evaluation of Myrinet-based Network Router

Performance Evaluation of Myrinet-based Network Router Performance Evaluation of Myrinet-based Network Router Information and Communications University 2001. 1. 16 Chansu Yu, Younghee Lee, Ben Lee Contents Suez : Cluster-based Router Suez Implementation Implementation

More information

Continuous Real Time Data Transfer with UDP/IP

Continuous Real Time Data Transfer with UDP/IP Continuous Real Time Data Transfer with UDP/IP 1 Emil Farkas and 2 Iuliu Szekely 1 Wiener Strasse 27 Leopoldsdorf I. M., A-2285, Austria, farkas_emil@yahoo.com 2 Transilvania University of Brasov, Eroilor

More information

What is the Future for High-Performance Networking?

What is the Future for High-Performance Networking? What is the Future for High-Performance Networking? Wu-chun (Wu) Feng feng@lanl.gov RADIANT: Research And Development in Advanced Network Technology http://www.lanl.gov/radiant Computer & Computational

More information

Parallel Computing Trends: from MPPs to NoWs

Parallel Computing Trends: from MPPs to NoWs Parallel Computing Trends: from MPPs to NoWs (from Massively Parallel Processors to Networks of Workstations) Fall Research Forum Oct 18th, 1994 Thorsten von Eicken Department of Computer Science Cornell

More information

Message Passing Architecture in Intra-Cluster Communication

Message Passing Architecture in Intra-Cluster Communication CS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan @cs.ucr.edu February 8, 2004 UC Riverside Slide 1 CS213 Outline 1 Kernel-based Message Passing

More information

Midterm II December 4 th, 2006 CS162: Operating Systems and Systems Programming

Midterm II December 4 th, 2006 CS162: Operating Systems and Systems Programming Fall 2006 University of California, Berkeley College of Engineering Computer Science Division EECS John Kubiatowicz Midterm II December 4 th, 2006 CS162: Operating Systems and Systems Programming Your

More information

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS CS6410 Moontae Lee (Nov 20, 2014) Part 1 Overview 00 Background User-level Networking (U-Net) Remote Direct Memory Access

More information

THE U-NET USER-LEVEL NETWORK ARCHITECTURE. Joint work with Werner Vogels, Anindya Basu, and Vineet Buch. or: it s easy to buy high-speed networks

THE U-NET USER-LEVEL NETWORK ARCHITECTURE. Joint work with Werner Vogels, Anindya Basu, and Vineet Buch. or: it s easy to buy high-speed networks Thorsten von Eicken Dept of Computer Science tve@cs.cornell.edu Cornell niversity THE -NET SER-LEVEL NETWORK ARCHITECTRE or: it s easy to buy high-speed networks but making them work is another story NoW

More information

Cluster Communication Protocols for Parallel-Programming Systems

Cluster Communication Protocols for Parallel-Programming Systems Cluster Communication Protocols for Parallel-Programming Systems KEES VERSTOEP, RAOUL A. F. BHOEDJANG, TIM RÜHL, HENRI E. BAL, and RUTGER F. H. HOFMAN Vrije Universiteit Clusters of workstations are a

More information

A LynxOS device driver for the ACENic Gigabit Ethernet Adapter

A LynxOS device driver for the ACENic Gigabit Ethernet Adapter A LynxOS device driver for the ACENic Gigabit Ethernet Adapter Abstract This document presents the development and the results of the implementation of a LynxOS device driver for the ACENic Gigabit ethernet

More information

LANCOM Techpaper IEEE n Indoor Performance

LANCOM Techpaper IEEE n Indoor Performance Introduction The standard IEEE 802.11n features a number of new mechanisms which significantly increase available bandwidths. The former wireless LAN standards based on 802.11a/g enable physical gross

More information

Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck

Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck Volker Lindenstruth; lindenstruth@computer.org The continued increase in Internet throughput and the emergence of broadband access networks

More information

Introduction to Operating Systems. Chapter Chapter

Introduction to Operating Systems. Chapter Chapter Introduction to Operating Systems Chapter 1 1.3 Chapter 1.5 1.9 Learning Outcomes High-level understand what is an operating system and the role it plays A high-level understanding of the structure of

More information

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors University of Crete School of Sciences & Engineering Computer Science Department Master Thesis by Michael Papamichael Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

More information

EXPLORING THE PERFORMANCE OF THE MYRINET PC CLUSTER ON LINUX Roberto Innocente Olumide S. Adewale

EXPLORING THE PERFORMANCE OF THE MYRINET PC CLUSTER ON LINUX Roberto Innocente Olumide S. Adewale EXPLORING THE PERFORMANCE OF THE MYRINET PC CLUSTER ON LINUX Roberto Innocente Olumide S. Adewale ABSTRACT Both the Infiniband and the virtual interface architecture (VIA) aim at providing effective cluster

More information

Component-Based Communication Support for Parallel Applications Running on Workstation Clusters

Component-Based Communication Support for Parallel Applications Running on Workstation Clusters Component-Based Communication Support for Parallel Applications Running on Workstation Clusters Antônio Augusto Fröhlich 1 and Wolfgang Schröder-Preikschat 2 1 GMD FIRST Kekulésraÿe 7 D-12489 Berlin, Germany

More information

CS330: Operating System and Lab. (Spring 2006) I/O Systems

CS330: Operating System and Lab. (Spring 2006) I/O Systems CS330: Operating System and Lab. (Spring 2006) I/O Systems Today s Topics Block device vs. Character device Direct I/O vs. Memory-mapped I/O Polling vs. Interrupts Programmed I/O vs. DMA Blocking vs. Non-blocking

More information

On Latency Management in Time-Shared Operating Systems *

On Latency Management in Time-Shared Operating Systems * On Latency Management in Time-Shared Operating Systems * Kevin Jeffay University of North Carolina at Chapel Hill Department of Computer Science Chapel Hill, NC 27599-3175 jeffay@cs.unc.edu Abstract: The

More information

Distributing Application and OS Functionality to Improve Application Performance

Distributing Application and OS Functionality to Improve Application Performance Distributing Application and OS Functionality to Improve Application Performance Arthur B. Maccabe, William Lawry, Christopher Wilson, Rolf Riesen April 2002 Abstract In this paper we demonstrate that

More information

Transactions on Information and Communications Technologies vol 9, 1995 WIT Press, ISSN

Transactions on Information and Communications Technologies vol 9, 1995 WIT Press,  ISSN Transactions on Information and Communications Technologies vol 9, 995 WIT Press, www.witpress.com, ISSN 743-357 Communications over ATM: communication software optimizing the L. Delgrossi*, G. Lo Re*

More information

4. Hardware Platform: Real-Time Requirements

4. Hardware Platform: Real-Time Requirements 4. Hardware Platform: Real-Time Requirements Contents: 4.1 Evolution of Microprocessor Architecture 4.2 Performance-Increasing Concepts 4.3 Influences on System Architecture 4.4 A Real-Time Hardware Architecture

More information

Operating Systems. 17. Sockets. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski

Operating Systems. 17. Sockets. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski Operating Systems 17. Sockets Paul Krzyzanowski Rutgers University Spring 2015 1 Sockets Dominant API for transport layer connectivity Created at UC Berkeley for 4.2BSD Unix (1983) Design goals Communication

More information

CERN openlab Summer 2006: Networking Overview

CERN openlab Summer 2006: Networking Overview CERN openlab Summer 2006: Networking Overview Martin Swany, Ph.D. Assistant Professor, Computer and Information Sciences, U. Delaware, USA Visiting Helsinki Institute of Physics (HIP) at CERN swany@cis.udel.edu,

More information

Introduction to Operating. Chapter Chapter

Introduction to Operating. Chapter Chapter Introduction to Operating Systems Chapter 1 1.3 Chapter 1.5 1.9 Learning Outcomes High-level understand what is an operating system and the role it plays A high-level understanding of the structure of

More information

Traffic Characteristics of Bulk Data Transfer using TCP/IP over Gigabit Ethernet

Traffic Characteristics of Bulk Data Transfer using TCP/IP over Gigabit Ethernet Traffic Characteristics of Bulk Data Transfer using TCP/IP over Gigabit Ethernet Aamir Shaikh and Kenneth J. Christensen Department of Computer Science and Engineering University of South Florida Tampa,

More information

Performance of a High-Level Parallel Language on a High-Speed Network

Performance of a High-Level Parallel Language on a High-Speed Network Performance of a High-Level Parallel Language on a High-Speed Network Henri Bal Raoul Bhoedjang Rutger Hofman Ceriel Jacobs Koen Langendoen Tim Rühl Kees Verstoep Dept. of Mathematics and Computer Science

More information

Optimizing TCP Receive Performance

Optimizing TCP Receive Performance Optimizing TCP Receive Performance Aravind Menon and Willy Zwaenepoel School of Computer and Communication Sciences EPFL Abstract The performance of receive side TCP processing has traditionally been dominated

More information

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte Alain Greiner Univ. Paris 6, France http://mpc.lip6.fr

More information

LAPI on HPS Evaluating Federation

LAPI on HPS Evaluating Federation LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of

More information

An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks

An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks Ryan G. Lane Daniels Scott Xin Yuan Department of Computer Science Florida State University Tallahassee, FL 32306 {ryanlane,sdaniels,xyuan}@cs.fsu.edu

More information

Using Switched Ethernet for Hard Real-Time Communication

Using Switched Ethernet for Hard Real-Time Communication International Conference on Parallel Computing in Electrical Engineering (PARELEC 2004), Sept. 2004, Dresden, Germany 1 Using Switched Ethernet for Hard Real-Time Communication Jork Loeser TU Dresden,

More information

Introduction to Operating Systems. Chapter Chapter

Introduction to Operating Systems. Chapter Chapter Introduction to Operating Systems Chapter 1 1.3 Chapter 1.5 1.9 Learning Outcomes High-level understand what is an operating system and the role it plays A high-level understanding of the structure of

More information

Recently, symmetric multiprocessor systems have become

Recently, symmetric multiprocessor systems have become Global Broadcast Argy Krikelis Aspex Microsystems Ltd. Brunel University Uxbridge, Middlesex, UK argy.krikelis@aspex.co.uk COMPaS: a PC-based SMP cluster Mitsuhisa Sato, Real World Computing Partnership,

More information

Measurement-based Analysis of TCP/IP Processing Requirements

Measurement-based Analysis of TCP/IP Processing Requirements Measurement-based Analysis of TCP/IP Processing Requirements Srihari Makineni Ravi Iyer Communications Technology Lab Intel Corporation {srihari.makineni, ravishankar.iyer}@intel.com Abstract With the

More information

Chapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition

Chapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition Chapter 7: Main Memory Operating System Concepts Essentials 8 th Edition Silberschatz, Galvin and Gagne 2011 Chapter 7: Memory Management Background Swapping Contiguous Memory Allocation Paging Structure

More information

VIA2SISCI A new library that provides the VIA semantics for SCI connected clusters

VIA2SISCI A new library that provides the VIA semantics for SCI connected clusters VIA2SISCI A new library that provides the VIA semantics for SCI connected clusters Torsten Mehlan, Wolfgang Rehm {tome,rehm}@cs.tu-chemnitz.de Chemnitz University of Technology Faculty of Computer Science

More information

Non-blocking Java Communications Support on Clusters

Non-blocking Java Communications Support on Clusters Non-blocking Java Communications Support on Clusters Guillermo L. Taboada*, Juan Touriño, Ramón Doallo UNIVERSIDADE DA CORUÑA SPAIN {taboada,juan,doallo}@udc.es 13th European PVM/MPI Users s Meeting (EuroPVM/MPI

More information

Research on the Implementation of MPI on Multicore Architectures

Research on the Implementation of MPI on Multicore Architectures Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China chengpq@gmail.com Yan Gu Department of Computer

More information

Evaluating Personal High Performance Computing with PVM on Windows and LINUX Environments

Evaluating Personal High Performance Computing with PVM on Windows and LINUX Environments Evaluating Personal High Performance Computing with PVM on Windows and LINUX Environments Paulo S. Souza * Luciano J. Senger ** Marcos J. Santana ** Regina C. Santana ** e-mails: {pssouza, ljsenger, mjs,

More information

Initial Performance Evaluation of the Cray SeaStar Interconnect

Initial Performance Evaluation of the Cray SeaStar Interconnect Initial Performance Evaluation of the Cray SeaStar Interconnect Ron Brightwell Kevin Pedretti Keith Underwood Sandia National Laboratories Scalable Computing Systems Department 13 th IEEE Symposium on

More information

Network Design Considerations for Grid Computing

Network Design Considerations for Grid Computing Network Design Considerations for Grid Computing Engineering Systems How Bandwidth, Latency, and Packet Size Impact Grid Job Performance by Erik Burrows, Engineering Systems Analyst, Principal, Broadcom

More information

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet Pilar González-Férez and Angelos Bilas 31 th International Conference on Massive Storage Systems

More information

Performance Evaluation of InfiniBand with PCI Express

Performance Evaluation of InfiniBand with PCI Express Performance Evaluation of InfiniBand with PCI Express Jiuxing Liu Server Technology Group IBM T. J. Watson Research Center Yorktown Heights, NY 1598 jl@us.ibm.com Amith Mamidala, Abhinav Vishnu, and Dhabaleswar

More information

Chapter 8: Memory-Management Strategies

Chapter 8: Memory-Management Strategies Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and

More information

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

IX: A Protected Dataplane Operating System for High Throughput and Low Latency IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this

More information