SPLINTERING TCP TO DECREASE SMALL MESSAGE LATENCY IN HIGH-PERFORMANCE COMPUTING BREANNE DUNCAN THESIS

Size: px
Start display at page:

Download "SPLINTERING TCP TO DECREASE SMALL MESSAGE LATENCY IN HIGH-PERFORMANCE COMPUTING BREANNE DUNCAN THESIS"

Transcription

1 SPLINTERING TCP TO DECREASE SMALL MESSAGE LATENCY IN HIGH-PERFORMANCE COMPUTING by BREANNE DUNCAN THESIS Submitted in Partial Fulfillment of the Requirements for the Degree of Bachelors of Science Computer Science The University of New Mexico Albuquerque, New Mexico May, 2003

2 c 2003, Breanne Duncan iii

3 Dedication For my mother, who I helped copy BASIC code out of magazines onto our Apple IIc when I was in kindergarten. Without her inspiration, my love for computing may have never been born. iv

4 Acknowledgments Foremost, I would like to thank my advisor Prof. Barney Maccabe for introducing me to the field of computer systems and networks, and allowing me the opportunity of doing undergraduate research. I owe many thanks to Patricia Crowley, my mentor, for dedicating much time to my success on this project and guiding me through my first research experience. Wenbin Zhu also was of much help in desperate times, as was Edgar León. v

5 SPLINTERING TCP TO DECREASE SMALL MESSAGE LATENCY IN HIGH-PERFORMANCE COMPUTING by BREANNE DUNCAN ABSTRACT OF THESIS Submitted in Partial Fulfillment of the Requirements for the Degree of Bachelors of Science Computer Science The University of New Mexico Albuquerque, New Mexico May, 2003

6 SPLINTERING TCP TO DECREASE SMALL MESSAGE LATENCY IN HIGH-PERFORMANCE COMPUTING by BREANNE DUNCAN B.S., Computer Science, University of New Mexico, 2003 Abstract The adoption of commodity networking protocols and hardware in high-performance computing yields low-cost alternatives to expensive proprietary communications infrastructures. However, commodity protocols do not inherently have the advantages of low overhead, low latency, and offloaded control programs associated with specialty components, such as Myrinet. Many high-performance computing applications depend on the rapid transmission of small messages between processors. Hosts in a high-performance computing environment using gigabit Ethernet and TCP/IP need mechanisms to acknowledge small messages quickly without engaging the operating system. By splintering TCP and transferring acknowledgment capabilities to the network interface card (NIC), an application avoids costly interrupts and context switches into the operating system that incur delay. Acknowledgments are sent earlier, allowing the sender to advance its TCP flow control window more frequently. Results from a proof of concept model suggest that small message acknowledgment latency can be reduced by approximately 40% through splintering the TCP stack. vii

7 Contents 1 Introduction 1 2 Motivations Host Bottlenecks Transmission Latency TCP Performance Issues Modifications for High-Performance Computing Protocol Offloading OS Bypass Zero-Copy Data Movement Advantages Disadvantages Partial OS Bypass Splintering viii

8 Contents 4 Splintering TCP Acknowledgments to Decrease Latency Error checking Assumptions Implementation Performance Testing Results 21 6 Related Work MPICH Application-Bypass Offloading RMPP RTS/CTS Trapeze: Optimizations for TCP/IP over Gigabit Networks Offloading IP Fragmentation and Reassembly Conclusions 29 References 30 ix

9 Chapter 1 Introduction High-performance computing is typically characterized by specially designed communications hardware and protocols. These features support maximum utilization of both node processing power and provide high communication bandwidth between nodes. The lowlatency transmission that many scientific applications depend on is inherent in such networks. While this approach may provide the best technology available at any given time, the high expense of implementing such a system is a deterrent in many circumstances. Typically only large institutions have access to specialized, proprietary hardware and protocols. A rapidly advancing alternative involves the use of long-established networking protocols running on commodity hardware. The advantages of a commodity approach are numerous. The cost of hardware is an obvious comparison. Commodity network cards, switches, and cabling are vastly less expensive than their proprietary counterparts. As of March 2003, gigabit Ethernet cards could be purchased for as low as $30[6], while Myrinet cards were around $1000[13]. Those who previously could not afford high-performance computing soon may be able to due to the low cost of commodity components. Cost is not the only advantage of using commodity components. Commodity products 1

10 Chapter 1. Introduction are readily available in abundance and many people are skilled in their use. Modification and maintenance of network stack and programmable network interface card (NIC) code is simplified through this ubiquitous knowledge. This speeds research and development times in the context of specialization for high-performance clusters. Commodity hardware and protocols are highly interoperable. This interoperability extends horizontally between host nodes in a network and vertically between network architecture layers. Commodity components are obviously not tailored to a high-performance computing environment. When commodity components are used without modification, one cannot expect satisfactory performance. It is necessary to change the functionality of networking hardware. One must tailor commodity protocols to respond to a high-performance environment. In theory, this approach yields the advantages of commodity components and the power of specialty components. Many high-performance computing applications rely on rapid communication of small messages between compute nodes. Because large quantities of such messages are sent, it is important that message latency be as low as possible. Decreased latency yields an increase in computational efficiency. The processor spends less time waiting to receive data before continuing computation. However, low latency is not inherent in commodity protocols and networks. Modifications must be made to protocol stacks and NIC and application interaction to ensure low latency while maintaining high bandwidth. The network interface card must take on responsibilities normally delegated to the host operating system (OS). This reduces operating system overhead and decreases message latency by maintaining a more direct link between the network and the process. High-performance applications and libraries usually differentiate between small and large messages and handle them very differently. For example, Sun MPI treats packets under 1KB as short messages, while larger packets are treated as long messages. Access permissions and memory locations for stored payloads are different for each message type. In this work, a small message refers to a packet less than 1500 bytes, the maximum size 2

11 Chapter 1. Introduction of an Ethernet frame. This work discusses latency reduction for small message traffic. Chapter 2 is devoted to the problems of host bottlenecks and transmission latency. Chapter 3 discusses modifications to tailor commodity protocols to high-performance computing. Chapter 4 looks at splintering TCP acknowledgments, a method proposed to decrease transmission latency in a high-performance infrastructure built on commodity components. Chapter 5 explores results from testing this model. Chapter 6 discusses related work. The final chapter presents conclusions about the efficacy of splintering TCP acknowledgments. 3

12 Chapter 2 Motivations Commodity components and protocols do not receive and process packets efficiently in a high-performance environment. The overhead incurred through processing a received packet in the operating system and copying data to the application increases latency dramatically. TCP and gigabit Ethernet packet handling must evolve to accommodate highperformance computing needs. The path between the network interface card and the application must minimize latency without putting undue strain on the NIC or host processor. 2.1 Host Bottlenecks High-speed networks, such as 10Gbps Ethernet, now allow us rapid point-to-point communication. However, we are overlooking bottlenecks internal to the host that seriously diminish communications efficiency. Culler et. al identify communication delay, overhead, and gap as key parameters in defining parallel computing performance[3]. Highperformance applications are sensitive to these metrics[12]. Low latency and overhead are necessary for efficient computation and communication[14]. An increase in any parameter may decrease overall performance. 4

13 Chapter 2. Motivations Culler et. al define latency as a metric pertaining to NIC-to-NIC transmissions. Because my work deals with application-to-application message transfer, I define latency as NIC-to-NIC transmission time plus time elapsed during communications processing at a host system. This gives a more realistic assessment of the time it takes data to travel between processes in a high-performance computing infrastructure. Although bandwidth and processor speeds continually increase, overhead and high application-to-application message latency remain a problem. A high-speed line sends small messages so quickly that it creates a massive amount of interrupt pressure within the kernel space of the receiving host[11][8]. Furthermore, the kernel must deal with processing packets and moving message payloads into user space after receipt and error checking. High communications overhead can overwhelm the processor and prevent it from spending valuable time on computation. This processing time also prevents messages from being delivered to the application quickly. Due to these blockades, applications cannot harness the bandwidth and speed the network provides[7, 3] Transmission Latency Many high-performance, scientific computing applications depend on rapid, low-latency transmission of messages between processors. High message latency leads to CPU idling and wasted resources. The application may wait for message arrival before continuing computation[3]. Ensuring consistently low message latency is key for computations linearly sensitive to communications delay, such as fast Fourier transforms, and jittersensitive, real-time applications. Processes frequently sending small messages between nodes suffer performance loss in a high-latency environment. Because much of their focus is on communication rather than strictly computation, inefficiencies in the network infrastructure or host message processing system negatively affect computational efficiency as well. Time spent waiting 5

14 Chapter 2. Motivations to receive messages is time wasted in terms of computation. Serially processing large numbers of small messages in turn can overwhelm the CPU and monopolize computing resources. A high-latency network prevents scalability as well. Adding more nodes to a network necessitates more communication and thus presents an additive increase in latency. Many scientific applications rely on small message transmission to gather values computed on other nodes for local computation. The fast Fourier transform is frequently used in scientific, high-performance computing. This operation is very sensitive to latency due to the small data messages passed between processes[9]. SMG2000, LU, and EM3D are examples of prominent latency-sensitive scientific benchmark applications. SMG2000, a supercomputing benchmark, is a parallel semicoarsening multigrid solver for the linear systems of a particular diffusion equation. Nodes running SMG2000 often need current approximate solutions from other processes before performing computations locally[10]. This makes the computationally-intensive program a communication-intensive program as well. LU is a NASA Advanced Supercomputing Parallel Benchmark (NPB 2) program that solves a finite difference discretization of the 3-D compressible Navier-Stokes equations. At each iteration of the algorithm, many small messages are sent between processes[4]. EM3D is the kernel of an application that models propagation of electromagnetic waves through three-dimensional objects. It is linearly sensitive to latency and the number of messages sent per processor is extremely high ( msg/proc/ms)[12] TCP Performance Issues Reliability and connection costs make TCP a poor choice for high-performance computing. Flow control, congestion control, and error-checking increase message latency. Connection maintenance presents scalability issues. For efficient communication, there must exist a connection between each pair of nodes. In an n node system, this incurs an n cost 6

15 Chapter 2. Motivations for connection and disconnection. A three-way handshake between each pair of nodes adds significant time and complexity to system startup. Closing connections presents a similar problem. 7

16 Chapter 3 Modifications for High-Performance Computing Many modifications of TCP/IP that increase protocol viability in high-performance computing have been suggested. The majority use protocol offloading to distribute network stack functionality over available hardware. Interrupt coalescing, offloaded packet fragmentation and reassembly, zero-copy data movement, and offloaded error checking have been used to increase commodity protocol performance. These techniques help increase processor effective utilization and communication bandwidth. They can increase application performance by reducing message latency and communication overhead for host processors. These optimizations alleviate problems that previously restricted commodity protocols and hardware from achieving peak performance. 3.1 Protocol Offloading Offloading parts of the networking protocol stack naturally transfers some of the work load from the CPU to the NIC processor. TCP and IP protocols, in whole or part, can be moved 8

17 Chapter 3. Modifications for High-Performance Computing to the network interface card to take strain off the host processor. A total offload strategy involves moving all protocol functionality onto the NIC. Alternatively, a protocol can be partially offloaded. In this method, it is broken up into separate pieces and spread across available resources. Research efforts have involved offloading parts of the IP protocol, such as error checking and packet fragmentation and reassembly. 3.2 OS Bypass Offloading can be used as a tool to allow communications tasks to bypass the operating system. Offloaded control allows the NIC to perform tasks normally delegated to the OS. After an application initially acquires a resource through the OS, the application is free to communicate with the resource directly without operating system intervention. In terms of latency and overhead, this strategy presents a one-time cost, rather than a per-packet cost Zero-Copy Data Movement Using OS bypass, the NIC and application have a direct communication link. A zero-copy strategy can be used to move data directly from the NIC to user space. An application expecting to receive data can send buffer descriptors to the NIC, informing it where to place data. The NIC places packet payloads in the appropriate location in user space. For fragmented packets, this requires the network interface card to perform message reassembly. Traditional, non-zero-copy mechanisms entail the network interface card sending packets to the OS, rather than directly to the application. The operating system processes a packet before sending the payload to the application. This involves performing any necessary error checking, packet reassembly, or adjustment of reliability parameters, such as advancing the sender s TCP flow control window. 9

18 Chapter 3. Modifications for High-Performance Computing Advantages OS bypass lowers message latency and CPU overhead. The zero-copy infrastructure prevents CPU and memory bus involvement in transferring received data into user space. The CPU is free to devote itself strictly to computation, rather than message processing and payload movement. In the context of small messages, OS Bypass helps messages quickly move from the NIC to the application without going through a middleman. Because the amount of data in each message is small, it is latency and message processing, not data processing, that presents a problem. Each small message that arrives must interrupt the operating system, incurring a per-packet cost in terms of latency and overhead. This is not as problematic with large packets that arrive less frequently[5]. Although network bandwidth may be at an acceptable rate, NIC to application packet transmission may incur unacceptable costs going through the OS. Network performance is limited not by the hardware, but instead by host bottlenecks[2] Disadvantages In a total OS bypass model, the NIC controls communications entirely after resource acquisition. All protocol processing is handled by the NIC. Ethernet Message Passing (EMP)[14] is one implementation of this strategy. EMP delegates virtual memory management to the application, while descriptors are handled by both the application and the NIC. The operating system does not intervene in communications and zero-copy is used for data transfer. Moving too much functionality to the NIC does not optimize resource use or performance. NIC processing power is dwarfed by that of a host system. For example, current Alteon ACENIC gigabit Ethernet cards have only two 88MHz processors and 2MB local RAM. Intense involvement in communications processing can overwhelm NIC resources. While EMP shows an improvement from a traditional setup in terms of small message 10

19 Chapter 3. Modifications for High-Performance Computing latency and bandwidth[14], other methods that use partial protocol offloading have been more successful[5] Partial OS Bypass Partial OS bypass involves offloading some, but not all tasks to the NIC to take advantage of NIC and CPU resources at an optimal level. Under partial OS bypass, the operating system is still involved in communications tasks. However, it is interrupted much less frequently and spends less computation cycles dealing with communication. It also does not force applications to use a particular protocol, as does total OS bypass when offloading the entire protocol stack[8]. 3.3 Splintering Splintering combines partial offloading and partial OS bypass methods to optimize communication between the network and the application. The kernel protocol stack is splintered, or broken apart, and tasks are offloaded to both the NIC and application. The operating system maintains control of resources, dealing with connection management, TCP reliability metrics, memory management, and scheduling. Pushing these concerns onto the network interface card would force the NIC to understand global resource management and resource protection. While it is possible to implement this, it would complicate NIC operation immensely and put strain on its limited processing power. Under splintering, the application notifies the network interface card of expected packets. It sends buffer descriptors to the NIC, giving the location at which to place packet payloads. The NIC moves messages directly to the application without intermediate buffer copies. This zero-copy strategy reduces delay for message transfer from the NIC to the application. After data transfer, the NIC may send message headers to the OS, or coa- 11

20 Chapter 3. Modifications for High-Performance Computing lesce them to send at a later time. Sending headers to the operating system ensures that the OS has current communications information. Figure 3.1 shows the splintered TCP architecture. Operating System socket read header Application receive buffer receive descriptor data packet NIC Figure 3.1: The splintered TCP architecture. I propose to splinter the TCP stack by offloading TCP acknowledgment (ACK) capabilities to the NIC. Sending ACKs at the NIC level, instead of in the kernel, decreases operating system involvement in message receipt. Application-to-application communication latency decreases and processor effective utilization increases. 12

21 Chapter 4 Splintering TCP Acknowledgments to Decrease Latency Offloading TCP acknowledgment functionality allows the network interface card to acknowledge packets as soon as they arrive. Normally, acknowledgments are made at the kernel level after packet processing takes place and data is sent to the application. The OS checks the payload in user space for errors before sending an ACK to the originating machine. This sequence of events increases acknowledgment latency substantially. Splintering TCP acknowledgments reduces application-to-application communication delay. Acknowledging packets at the NIC allows ACK messages to be received earlier at a peer node. In the context of small messages arriving frequently, low-latency ACKs are vital to keep nodes synchronized efficiently. Upon acknowledgment arrival, the OS can adjust the connection s TCP flow control window. Thus, the window is advanced more frequently and more data is sent. With enough optimization, the network becomes bandwidth-limited rather than latency-limited in terms of application-to-application communications speed. Figure 4.1 contrasts this strategy with the common TCP acknowledgment method. 13

22 Chapter 4. Splintering TCP Acknowledgments to Decrease Latency Operating System Operating System data (3) Application packet (2) ACK (4) receive header (4) buffer Application receive buffer ACK (5) ACK (2) data (3) packet (1) NIC packet (1) NIC EXISTING TCP ACKNOWLEDGMENT PATH PROPOSED TCP ACKNOWLEDGMENT PATH Figure 4.1: Normal (left) and splintered (right) TCP acknowledgment processes. This optimization is highly interoperable between compute nodes. Splintering TCP acknowledgments on one node is transparent to all other nodes. The TCP and IP protocols do not appear modified from outside the given node. Thus, splintering is not required for all nodes in a system. Communications performance can be improved if only one host in a pair of connected nodes has a splintered TCP stack. Splintering TCP acknowledgments is a receive-side optimization. The sender must still handle incoming acknowledgments explicitly in the operating system because the NIC does not handle TCP control messages. 4.1 Error checking TCP error checking is splintered and delegated to the application. An application can checksum while reading data from the receive buffer. Alternately, the application can choose to avoid error checking entirely if desired. This strategy is useful for real-time applications in which checksumming adds unacceptable delay. Moving error checking to the application level reduces TCP to a protocol that is no longer reliable between processes. Applications that need this assurance must provide other reliability mechanisms. 14

23 Chapter 4. Splintering TCP Acknowledgments to Decrease Latency Removing error checking from the kernel protocol stack absolves the need for each packet to interrupt the OS. Also, the message does not need to be copied to kernel space. This decreases host interrupt pressures and increases CPU availability. The NIC avoids involvement in error checking as well. Research has shown that error checking is too computationally intensive for NIC processors[5]. Offloading the task to the NIC decreases system performance. Moreover, delegating error handling and correction to the NIC is extremely inefficient. The NIC is forced to handle a large array of conditions that occur infrequently and require much processing. Acknowledging packets at the NIC diminishes the reliability of TCP because ACKs are sent before message data is confirmed error-free. Thus the TCP protocol is modified and no longer adheres to its original node-to-node error-free transmission guarantee. However, this method reduces message latency and errors may be handled through other mechanisms. 4.2 Assumptions Following Amdahl s law, splintering focuses on a performance increase for the common case. For latency-sensitive, high-performance applications, this entails offloading operations that would allow small data messages to arrive at the application sooner. Small messages represent the bulk of traffic for such applications. Packet corruption on the PCI bus is handled by the application. Out-of-order packets are handled by the OS. This strategy relies on the fact that the vast majority of packets arrive error-free and in correct order. Optimization of these infrequent events would not yield a large performance gain, and would give undue complexity to the model. Splintering focuses on improving common case performance. TCP control message arrival is not a common case, and is therefore handled by the op- 15

24 Chapter 4. Splintering TCP Acknowledgments to Decrease Latency erating system. The NIC sends connection and disconnection requests and other connection management messages directly to the OS. The network interface card should handle TCP connection and disconnection requests efficiently. Checking each message to see if it deals with connection control does not add an unacceptable amount of overhead and latency to message receipt and processing. Message headers are already checked to identify the connection to which an arriving packet belongs. If the packet does not pertain to the high-performance application that is being optimized, it is passed to the operating system. Thus checking control bits in the message header to determine message type would not add delay. The reliability of wired networks suggests that packets are rarely, if ever, corrupted in transport. If a message is corrupted on the wire between nodes, the NIC is likely to detect the error while performing the cyclic redundancy check (CRC). The packet is then dropped before an acknowledgment is sent. This provides the standard TCP error control mechanism: if no ACK is received, the packet is retransmitted. The ACK is sent in error only if the packet payload is corrupted on the PCI bus while traveling from the NIC to user space. This rarely occurs. If an application needs complete end-to-end error checking, it can perform TCP checksumming on received payloads. If an error is detected, another protocol must be used to request retransmission. In splintering TCP acknowledgments, end-to-end reliability is sacrificed for latency reduction. 4.3 Implementation My research involves building a proof-of-concept modeling showing reduced latency for acknowledgments made at the NIC level. To do this, I measured the amount of latency incurred as a small (64-byte) message passes from the NIC to an application. Comparative analysis demonstrates the value of sending ACKs at a lower level of the network infrastructure. 16

25 Chapter 4. Splintering TCP Acknowledgments to Decrease Latency My model consists of one host acting as an echo client to send small packets to a second host. The second host echoes back these packets at the NIC and application levels. All messages in a batch are echoed off of only one level to prevent an additive increase in latency from the echo operation. Echo server functionality was added to the programmable NIC s firmware receive routine. A simple program acting as an echo server sends back packets at the application level. When echoing at the NIC, received messages are not passed up to higher layers. The machine acting as an echo client has unmodified NIC firmware and an unmodified network stack. Figure 4.2 shows echo locations. Application Application OS OS NIC NIC ECHO CLIENT ECHO SERVER Figure 4.2: Echo server model. The NIC echo process consists of simply swapping source and destination addresses in the packet header. Ethernet MAC and IP addresses undergo this swap. Source and destination port numbers in the transport layer header are switched as well. After these changes are made, the echo packet is enqueued in the NIC MAC engine to be sent. The DMA engine is not activated to send the packet to the OS as it normally would. Application-level echoing does not involve explicitly modifying packet contents. The echo program receives a payload through a socket and then calls a user-level UDP send function to transmit the payload to the originating machine. Pseudocode for measurements is shown in Figure byte buffers represent 10-byte packet payloads that, when combined with Ethernet, IP, and UDP headers, yield a 64-byte packet. This process models sending TCP acknowledgments after the message has traveled 17

26 Chapter 4. Splintering TCP Acknowledgments to Decrease Latency process echo client char buffer[10]; time t latency[n]; time t t1, t2; for( int i=0; i n; i++ ) t1 = gettime(); send buffer; receive buffer; t2 = gettime(); latency[i] = t2 - t1; process NIC echo char buf[recv MSG SIZE]; byte tmp48[6]; byte tmp32[4]; byte tmp16[2]; tmp48 = *(buf+src MAC addr); *(buf+src MAC addr) = *(buf+dst MAC addr); *(buf+dst MAC addr) = tmp48; tmp32 = *(buf+src IP addr); *(buf+src IP addr) = *(buf+dst IP addr); *(buf+dst IP addr) = tmp32; tmp16 = *(buf+src port); *(buf+src port) = *(buf+dst port); *(buf+dst port) = tmp16; set frame length(); set DMA read location(); enqueue MAC transmission(&buf); process APP echo char buffer[10]; while( 1 ) receive buffer; send buffer; Figure 4.3: Pseudocode for echo testing. to the given layer. Acknowledgments are sent directly from NIC upon message receipt. At the application level, ACKs are sent from the TCP stack in the kernel as per normal protocol operation. However, the message must be received by the application before this occurs. Thus echoing messages at the application includes the latency incurred from the OS passing packet data to the application before it sends an ACK. In principle, sending an echo packet at each layer mimics sending an ACK after a packet or payload arrives at the given layer. Because this is a proof-of-concept model and not a simulation, several issues are ignored in implementation. A zero-copy mechanism is not implemented. The echo server application does not send buffer descriptors to the network interface card to notify it of impending message receipt. No pages are pinned for the NIC to transfer packet payloads to user space. When echoing at the NIC, message headers are not sent to the OS because packets do not travel past the NIC level. In an application-level echo, the echo server does 18

27 Chapter 4. Splintering TCP Acknowledgments to Decrease Latency not perform error-checking. However, error checking costs are included because checksumming is performed in the OS before the packet is passed up. Connection multiplexing is also ignored in this model. 4.4 Performance Testing Latency measurements for an implementation of the proof-of-concept model were gathered using ping-pong testing through the echo model described in section 4.3. Data was gathered from the NIC and application levels. I ran the experiments using two Dell Precision WorkStation 620 MT machines with 933MHz Pentium III processors. Each machine has 256MB of RAM, a 64-bit, 66MHz PCI bus, and a Alteon ACENIC Gigabit Ethernet adapter with a Tigon II chipset. The machines network interfaces were directly connected no switches or routers were present as intermediaries. Both systems run the Linux kernel and Red Hat 7.0. Each machine had a warmed cache and a static ARP table populated with the peer machine s data. Interrupt coalescing was turned off during testing. The echo server and client applications were the only explicit user process running on the machines. Because this splintering model focuses on optimizing data messages rather than control messages, UDP packets can be used for testing even though the focus of this work is TCP. Between connection and disconnection, TCP data packets represent the majority of messages being sent. UDP and TCP data packets are nearly the same in format. Thus message latencies would be almost identical if TCP packets were used for testing and connection issues ignored. A simple user-level program acted as an echo client on the machine with unmodified protocol stacks and NIC firmware. UDP packets were sent through a datagram socket. These 64-byte messages traveled through the normal OS network stack to the echoing 19

28 Chapter 4. Splintering TCP Acknowledgments to Decrease Latency machine. Latency was measured as the time elapsed between sending a message and receiving an echoed copy at the application. Iterations of 100, 1000, and 5000 messages were sent sequentially to get accurate latency measurements and observe change in latency across time and message cluster size. Latency measurements from this testing model should be very similar to those in a real implementation of the splintered TCP strategy. Splintering TCP acknowledgments seeks to remove error checking and flow control overhead from the receive-side latency path. UDP does not involve checksumming 1 or any reliability mechanisms. Thus this model simulates TCP acknowledgments being made at the NIC before any of these operations would take place. This is because the network interface card simply echoes the message at the same point that an ACK would be sent. However, echoing at the application level still involved IP error checking in these tests. Using the splintering strategy wherein error checking and flow control window advancement are done after the ACK is performed, reported latency measurements for these layers are slightly high. Using splintering, IP checksums would take place after the echo. However, if the normal model is followed and reliability operations take place first, reported latencies are lower than real latencies because TCP checksum and connection control operations would be performed before the echo is sent. Also, measurements may be slightly lower on other gigabit NICs. The Alteon NIC used for testing is known to have a slow DMA engine. 1 UDP checksumming is optional. It is not used in this testing. 20

29 Chapter 5 Results Tests showed a marked latency reduction in small messages echoed at the NIC level rather than at the OS or application. A 1,000 packet run gave an average of 195 s for application echoes and 117 s for NIC echoes. Echoing at the NIC decreased latency by roughly 40%. The standard deviation for NIC measurements was 3 s and for application measurements, 54 s. Figure 5.1 shows ping-pong latency measurements for 64-byte packets. The first packet echoed from both levels had higher than average latency. NIC startup costs may explain this delay. 100, 1000, and 5000 packet iterations gave nearly identical latency measurements. Calculated average latencies were slightly different due to the degree of influence of the preliminary high-latency packets. Figure 5.1 shows clusters of low-latency application echoes. Approximately one-third of these packets had round-trip times very close to NIC latency measurements. This suggests that the operating system was running during this time interval and able to accept incoming packets without a scheduling delay. Messages for the echo application interrupt the operating system, which is already running. The operating system schedules itself to run frequently because there is only one explicit user process running on the echo server system. Transferring a packet payload incurs a context switch into the application and a 21

30 Chapter 5. Results 300 NIC Application Latency (usec) Message Number Figure 5.1: Latency measurements for NIC and application echo (64-byte packets). payload copy from kernel space to user space. Latency measurements suggest that this context switch is minimal and does not involve flushing cache or TLB entries. For these reasons, a high percentage of application echos to incur almost as little latency as NIC echos. When six spinning processes were added to the system on which the application echo server ran, the number of packets at the low latency level dropped by over 50%. The intricate relationship between the MAC and DMA engines in the Alteon NICs may help explain the minimal difference between the low application echo measurements and the NIC echo measurements. The Alteon NICs provide no strong mechanisms for controlling packet transmission. Packets are enqueued in the MAC engine, but other factors control when the packet is actually sent. It appears that the DMA engine must be active 22

31 Chapter 5. Results for enqueued packets to leave the NIC. This posed a problem in the context of echoing packets at the NIC level without using DMA to transfer packet payloads to the OS. During application-level echoing, the DMA engine is constantly running. For NIC echos, the DMA engine was coerced to allow the MAC engine to put echo packets on the wire. This implies that the DMA engine was not active during the NIC echo process and thus an incurred engine startup cost of roughly 5 s on each packet sent. Thus the NIC measurements may be slightly high in this implementation because of Alteon hardware issues. Because the packet payload used in testing is so small (10 bytes), the time the OS spends copying this data into user space is negligible. Further testing with larger messages showed a larger gap between the low-level application echo latencies and NIC echo latencies. Minimum latencies for application echos went up 16 s and 106 s for 512-byte and 1KB messages, respectively. This suggests that a zero-copy mechanism is only valuable for small messages in terms of their movement from the NIC to the OS. Possible scheduling and context switch costs incur high latency in this part of the message path. The OS to application copy incurs almost no latency for small packets. Using the higher level of application latency measurements as a baseline, tests show echoing at the network interface card saves approximately 115 s of delay. Amortized over low and high application echo measurements reported here, latency decreases by roughly 78 s. If the echo packet were a TCP acknowledgment message, a sender s TCP flow control window would be advanced tens of microseconds sooner. This latency reduction would occur the majority of the time on a loaded system because the operating system often is not running during message arrival. On a system running a single process, this performance gain may not occur as frequently. In either case, splintering TCP acknowledgment is a viable method to reduce application-to-application communications delay. 23

32 Chapter 6 Related Work 6.1 MPICH Application-Bypass Buntinas, Panda, and Brightwell present an application-bypass strategy to reduce message latency, communications overhead, and CPU idle time[1]. Their research focuses on highperformance applications that frequently use MPICH broadcast operations over a Myrinet network. They note that process skew, or lack of synchronization, can lead to compute nodes wasting processor cycles simply waiting for other nodes to complete tasks. In a tree broadcast model, where nodes receive a message and forward it to their children, this idling causes latency to cascade and increase as the message spreads. Additional latency is incurred when the process receiving the message performs other tasks before calling the MPICH broadcast function to forward the message. In this application-bypass model, the MPICH library forwards a broadcast message as soon as it is received. By removing the application from this process and broadcasting at a lower level of the network stack, child processes receive messages faster. They do not wait on another application to make an explicit broadcast call. This application-bypass implementation decreases message latency even as processor skew increases. Results show up to a sixteen-fold improvement for 24

33 Chapter 6. Related Work MPICH broadcasts using application-bypass versus those that do not. Both MPICH application-bypass and splintering avoid traveling through upper layers of the network stack to reduce communications delay and overhead. However, MPICH application-bypass deals strictly with MPI, whereas splintering TCP acknowledgments does not pertain to any particular message transmission interface and deals with a more ubiquitous protocol. In the context of process skew, splintering can be modified to provide mechanisms for higher level libraries to provide synchronization routines on the network interface card. 6.2 Offloading RMPP RTS/CTS Maccabe et. al experiment with offloading parts of the Reliable Message Passing Protocol (RMPP) to programmable gigabit Ethernet cards[11]. In particular, they offload message fragmentation and reassembly, zero-copy mechanisms, and Request to Send (RTS) and Clear to Send (CTS) message processing. Arriving data packets are normally copied into kernel space upon receipt. By offloading processing associated with message receipt and using a zero-copy strategy, communication overhead decreases. When the RMPP module sends a CTS packet to the requesting host, the RMPP library also sends the NIC the memory address at which it can place incoming packets. Offloading sender-side processing allows the NIC to respond to RMPP messages locally, rather than delegating this responsibility to the OS. When an initial RTS message is sent through the NIC, a message buffer descriptor is also sent, notifying the NIC of the memory location of data to be transmitted. The NIC can then access this memory directly to construct and send packets, without going through with the OS. These send and receive optimizations allow more packets to be sent in a given period 25

34 Chapter 6. Related Work of time, increasing network bandwidth. Experimenting with this implementation using Alteon ACENIC gigabit Ethernet cards yielded a bandwidth increase of over 50% and up to a 20% increase of CPU availability for messages of several hundred kilobytes. This approach deals with a specialized message passing protocol rather than a commodity protocol. It also focuses on increasing bandwidth and processor availability, rather than message latency. Offloading message fragmentation and reassembly focuses on large message transmission, as do RTS and CTS operations. Applications transmitting small messages do not need RTS/CTS dynamics or fragmentation handling and thus do not reap a performance gain. 6.3 Trapeze: Optimizations for TCP/IP over Gigabit Networks Chase et. al propose a variety of optimizations for gigabit networks using commodity protocols. Their Trapeze messaging system seeks to maintain high network bandwidth, increase CPU utilization at both the sender and receiver, and decrease communications latency [5]. They implement TCP/IP checksum offload, adaptive message pipelining, zerocopy data movement, and configurable MTUs. Scatter/gather DMA allows payloads to occupy multiple, noncontiguous page frames. Pipelining allows overlapping DMA transfers on the I/O bus and network to reduce large message latency. Experiments using these techniques over Myrinet and gigabit Ethernet yielded large increases in TCP bandwidth and moderate decreases in CPU utilization. UDP one-way message latency for 64-byte packets decreased by roughly 40%. IP packets are encapsulated in Trapeze messages. Protocol headers are located in the control message portion of a Trapeze packet and a payload is attached. Thus the implementation does not use purely commodity protocols. The ability to configure large MTUs 26

35 Chapter 6. Related Work and adaptive message pipelining yields a performance gain only for large messages. It is only applicable for Myrinet networks, which have an unlimited frame size. Latency reduction is based on only zero-copy techniques and figures are for one-way transmission. Zero-copy decreases delay between the network interface card and application, as it does in splintering. However, Trapeze ignores application-to-application round trip message relay and acknowledgment. 6.4 Offloading IP Fragmentation and Reassembly IP packet endpoint fragmentation and reassembly can be delegated to the NIC to reduce communications overhead. Gilfeather et. al demonstrate that this technique yields increased CPU utilization and reduced message latency[8]. This task is not too computationally intensive for a commodity programmable NIC, unlike IP error checking. Splintering IP in this manner uses available resources optimally. Messages larger than the transmission medium s MTU are allowed to pass through the network stack above the data link layer. The NIC fragments the messages to MTU size before transmission. When the NIC receives a fragmented message it reassembles it locally, rather than delegating the task to the OS. The entire packet then travels up the network stack. Offloaded fragmentation and reassembly is transparent to the application. Because messages are fragmented at a lower level in the network stack, simple modifications must be made to allow messages larger than the MTU at those levels. Unlike interrupt coalescing, this technique does not increase message latency. The functionality of the protocol is left unchanged. Its tasks are simply taking place on different hardware. Offloading this part of the IP protocol decreases CPU communications processing by about 50% while increasing processor effective utilization nearly twofold for large messages. This technique is an optimization for large messages. Small messages are not frag- 27

36 Chapter 6. Related Work mented or reassembled at any layer of the network stack. Thus offloading this operation does not decrease latency or optimize any other communication parameters for small message transfer. 28

37 Chapter 7 Conclusions Splintering takes advantages of available resources at an optimal level. By allowing the operating system to retain control of resources, memory and connections are managed efficiently. Moreover, the operating system sees a reduction in communications overhead. Parts of the protocol stack are offloaded to the application and network interface card, but neither are overwhelmed by the tasks they inherit. Splintering TCP acknowledgments yields reduced round-trip communications delay between applications in a high-performance computing environment. Although end-toend reliability is sacrificed, the process of sending TCP acknowledgments no longer resides in the latency path. Thus there is no per-packet delay cost for sending ACKs. Also, bypassing the operating system on message receipt decreases communications overhead. This increases processor availability and allows the CPU to devote more cycles to computation. These modifications adapt commodity TCP to a parallel computing environment. Using splintering and other optimizing techniques, TCP may become a viable protocol for high-performance computing. 29

38 References [1] D. Buntinas, D. K. Panda, and R. Brightwell. Application-bypass broadcast in MPICH over GM. In Proceedings of The Third IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2003), To appear. [2] J. Chase, A. Gallatin, and K. Yocum. End system optimizations for high-speed TCP. IEEE Communications Magazine, 39(4):68 74, [3] D. Culler, R. Karp, D. Patterson, A. Sahay, K. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: Towards a Realistic Model of Parallel Computation. In Proceedings 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 1 12, [4] R. Fatoohi and S. Weeratunga. Performance evaluation of three distributed computing environments for scientific applications. In Proceedings Supercomputing 94, pages , Washington, DC, November [5] A. Gallatin, J. Chase, and K. Yocum. Trapeze/IP: TCP/IP at near-gigabit speeds. In Proceedings of the USENIX 99 Technical Conference, pages , June [6] Network & communications: Gigabit ethernet. Web, April [7] P. Gilfeather and A. Maccabe. Making TCP Viable as a High Performance Computing Protocol. In Proc. Los Alamos Computer Science Institute (LACSI) Symposium, [8] P. Gilfeather, A. Maccabe, and T. Underwood. Increasing performance in commodity IP. In Proc. Grace Hopper Celebration of Women in Computing, [9] P. D. Haynes and M. Côté. Parallel fast fourier transforms for electronic structure calculations. Computer Physics Communications, 129: ,

39 References [10] E. León. An MPI tool to measure application sensitivity to variation in communication parameters. Technical Report TR-CS , University of New Mexico, [11] A. Maccabe, W. Zhu, J. Otto, and R. Riesen. Experience in offloading protocol processing to a programmable NIC. In SC2002 High Performance Networking and Computing, Baltimore, MD, November [12] R. P. Martin, A. Vahdat, D. E. Culler, and T. E. Anderson. Effects of communication latency, overhead, and bandwidth in a cluster architecture. In ISCA, pages 85 97, [13] Myrinet product list. Web, April list.html. [14] P. Shivam, P. Wyckoff, and D. K. Panda. EMP: Zero-copy OS-bypass NIC-driven Gigabit Ethernet message passing. In Proceedings of SC2001,

An Extensible Message-Oriented Offload Model for High-Performance Applications

An Extensible Message-Oriented Offload Model for High-Performance Applications An Extensible Message-Oriented Offload Model for High-Performance Applications Patricia Gilfeather and Arthur B. Maccabe Scalable Systems Lab Department of Computer Science University of New Mexico pfeather@cs.unm.edu,

More information

Making TCP Viable as a High Performance Computing Protocol

Making TCP Viable as a High Performance Computing Protocol Making TCP Viable as a High Performance Computing Protocol Patricia Gilfeather and Arthur B. Maccabe Scalable Systems Lab Department of Computer Science University of New Mexico pfeather@cs.unm.edu maccabe@cs.unm.edu

More information

Experience in Offloading Protocol Processing to a Programmable NIC

Experience in Offloading Protocol Processing to a Programmable NIC Experience in Offloading Protocol Processing to a Programmable NIC Arthur B. Maccabe, Wenbin Zhu Computer Science Department The University of New Mexico Albuquerque, NM 87131 Jim Otto, Rolf Riesen Scalable

More information

An Extensible Message-Oriented Offload Model for High-Performance Applications

An Extensible Message-Oriented Offload Model for High-Performance Applications An Extensible Message-Oriented Offload Model for High-Performance Applications Patricia Gilfeather and Arthur B. Maccabe Scalable Systems Lab Department of Computer Science University of New Mexico pfeather@cs.unm.edu,

More information

Identifying the Sources of Latency in a Splintered Protocol

Identifying the Sources of Latency in a Splintered Protocol Identifying the Sources of Latency in a Splintered Protocol Wenbin Zhu, Arthur B. Maccabe Computer Science Department The University of New Mexico Albuquerque, NM 87131 Rolf Riesen Scalable Computing Systems

More information

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication John Markus Bjørndalen, Otto J. Anshus, Brian Vinter, Tore Larsen Department of Computer Science University

More information

Distributing Application and OS Functionality to Improve Application Performance

Distributing Application and OS Functionality to Improve Application Performance Distributing Application and OS Functionality to Improve Application Performance Arthur B. Maccabe, William Lawry, Christopher Wilson, Rolf Riesen April 2002 Abstract In this paper we demonstrate that

More information

Introduction to TCP/IP Offload Engine (TOE)

Introduction to TCP/IP Offload Engine (TOE) Introduction to TCP/IP Offload Engine (TOE) Version 1.0, April 2002 Authored By: Eric Yeh, Hewlett Packard Herman Chao, QLogic Corp. Venu Mannem, Adaptec, Inc. Joe Gervais, Alacritech Bradley Booth, Intel

More information

19: Networking. Networking Hardware. Mark Handley

19: Networking. Networking Hardware. Mark Handley 19: Networking Mark Handley Networking Hardware Lots of different hardware: Modem byte at a time, FDDI, SONET packet at a time ATM (including some DSL) 53-byte cell at a time Reality is that most networking

More information

6.9. Communicating to the Outside World: Cluster Networking

6.9. Communicating to the Outside World: Cluster Networking 6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and

More information

An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks

An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks Ryan G. Lane Daniels Scott Xin Yuan Department of Computer Science Florida State University Tallahassee, FL 32306 {ryanlane,sdaniels,xyuan}@cs.fsu.edu

More information

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK [r.tasker@dl.ac.uk] DataTAG is a project sponsored by the European Commission - EU Grant IST-2001-32459

More information

440GX Application Note

440GX Application Note Overview of TCP/IP Acceleration Hardware January 22, 2008 Introduction Modern interconnect technology offers Gigabit/second (Gb/s) speed that has shifted the bottleneck in communication from the physical

More information

King Fahd University of Petroleum and Minerals College of Computer Sciences and Engineering Department of Computer Engineering

King Fahd University of Petroleum and Minerals College of Computer Sciences and Engineering Department of Computer Engineering Student Name: Section #: King Fahd University of Petroleum and Minerals College of Computer Sciences and Engineering Department of Computer Engineering COE 344 Computer Networks (T072) Final Exam Date

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

Operating Systems. 16. Networking. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski

Operating Systems. 16. Networking. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski Operating Systems 16. Networking Paul Krzyzanowski Rutgers University Spring 2015 1 Local Area Network (LAN) LAN = communications network Small area (building, set of buildings) Same, sometimes shared,

More information

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup Chapter 4 Routers with Tiny Buffers: Experiments This chapter describes two sets of experiments with tiny buffers in networks: one in a testbed and the other in a real network over the Internet2 1 backbone.

More information

Can User-Level Protocols Take Advantage of Multi-CPU NICs?

Can User-Level Protocols Take Advantage of Multi-CPU NICs? Can User-Level Protocols Take Advantage of Multi-CPU NICs? Piyush Shivam Dept. of Comp. & Info. Sci. The Ohio State University 2015 Neil Avenue Columbus, OH 43210 shivam@cis.ohio-state.edu Pete Wyckoff

More information

ECE 650 Systems Programming & Engineering. Spring 2018

ECE 650 Systems Programming & Engineering. Spring 2018 ECE 650 Systems Programming & Engineering Spring 2018 Networking Transport Layer Tyler Bletsch Duke University Slides are adapted from Brian Rogers (Duke) TCP/IP Model 2 Transport Layer Problem solved:

More information

Measurement-based Analysis of TCP/IP Processing Requirements

Measurement-based Analysis of TCP/IP Processing Requirements Measurement-based Analysis of TCP/IP Processing Requirements Srihari Makineni Ravi Iyer Communications Technology Lab Intel Corporation {srihari.makineni, ravishankar.iyer}@intel.com Abstract With the

More information

Optimizing TCP Receive Performance

Optimizing TCP Receive Performance Optimizing TCP Receive Performance Aravind Menon and Willy Zwaenepoel School of Computer and Communication Sciences EPFL Abstract The performance of receive side TCP processing has traditionally been dominated

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects

More information

A Simulation: Improving Throughput and Reducing PCI Bus Traffic by. Caching Server Requests using a Network Processor with Memory

A Simulation: Improving Throughput and Reducing PCI Bus Traffic by. Caching Server Requests using a Network Processor with Memory Shawn Koch Mark Doughty ELEC 525 4/23/02 A Simulation: Improving Throughput and Reducing PCI Bus Traffic by Caching Server Requests using a Network Processor with Memory 1 Motivation and Concept The goal

More information

PLEASE READ CAREFULLY BEFORE YOU START

PLEASE READ CAREFULLY BEFORE YOU START MIDTERM EXAMINATION #2 NETWORKING CONCEPTS 03-60-367-01 U N I V E R S I T Y O F W I N D S O R - S c h o o l o f C o m p u t e r S c i e n c e Fall 2011 Question Paper NOTE: Students may take this question

More information

Implementation and Analysis of Large Receive Offload in a Virtualized System

Implementation and Analysis of Large Receive Offload in a Virtualized System Implementation and Analysis of Large Receive Offload in a Virtualized System Takayuki Hatori and Hitoshi Oi The University of Aizu, Aizu Wakamatsu, JAPAN {s1110173,hitoshi}@u-aizu.ac.jp Abstract System

More information

STEVEN R. BAGLEY PACKETS

STEVEN R. BAGLEY PACKETS STEVEN R. BAGLEY PACKETS INTRODUCTION Talked about how data is split into packets Allows it to be multiplexed onto the network with data from other machines But exactly how is it split into packets and

More information

Solace Message Routers and Cisco Ethernet Switches: Unified Infrastructure for Financial Services Middleware

Solace Message Routers and Cisco Ethernet Switches: Unified Infrastructure for Financial Services Middleware Solace Message Routers and Cisco Ethernet Switches: Unified Infrastructure for Financial Services Middleware What You Will Learn The goal of zero latency in financial services has caused the creation of

More information

The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook)

The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook) Workshop on New Visions for Large-Scale Networks: Research & Applications Vienna, VA, USA, March 12-14, 2001 The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook) Wu-chun Feng feng@lanl.gov

More information

Homework 1. Question 1 - Layering. CSCI 1680 Computer Networks Fonseca

Homework 1. Question 1 - Layering. CSCI 1680 Computer Networks Fonseca CSCI 1680 Computer Networks Fonseca Homework 1 Due: 27 September 2012, 4pm Question 1 - Layering a. Why are networked systems layered? What are the advantages of layering? Are there any disadvantages?

More information

Guide to Networking Essentials, 6 th Edition. Chapter 5: Network Protocols

Guide to Networking Essentials, 6 th Edition. Chapter 5: Network Protocols Guide to Networking Essentials, 6 th Edition Chapter 5: Network Protocols Objectives Describe the purpose of a network protocol, the layers in the TCP/IP architecture, and the protocols in each TCP/IP

More information

Introduction to Networks and the Internet

Introduction to Networks and the Internet Introduction to Networks and the Internet CMPE 80N Announcements Project 2. Reference page. Library presentation. Internet History video. Spring 2003 Week 7 1 2 Today Internetworking (cont d). Fragmentation.

More information

Motivation CPUs can not keep pace with network

Motivation CPUs can not keep pace with network Deferred Segmentation For Wire-Speed Transmission of Large TCP Frames over Standard GbE Networks Bilic Hrvoye (Billy) Igor Chirashnya Yitzhak Birk Zorik Machulsky Technion - Israel Institute of technology

More information

Computer Communication Networks Midterm Review

Computer Communication Networks Midterm Review Computer Communication Networks Midterm Review ICEN/ICSI 416 Fall 2018 Prof. Aveek Dutta 1 Instructions The exam is closed book, notes, computers, phones. You can use calculator, but not one from your

More information

Enabling Gigabit IP for Intelligent Systems

Enabling Gigabit IP for Intelligent Systems Enabling Gigabit IP for Intelligent Systems Nick Tsakiris Flinders University School of Informatics & Engineering GPO Box 2100, Adelaide, SA Australia Greg Knowles Flinders University School of Informatics

More information

AN MPI TOOL TO MEASURE APPLICATION SENSITIVITY TO VARIATION IN COMMUNICATION PARAMETERS EDGAR A. LEÓN BORJA

AN MPI TOOL TO MEASURE APPLICATION SENSITIVITY TO VARIATION IN COMMUNICATION PARAMETERS EDGAR A. LEÓN BORJA AN MPI TOOL TO MEASURE APPLICATION SENSITIVITY TO VARIATION IN COMMUNICATION PARAMETERS by EDGAR A. LEÓN BORJA B.S., Computer Science, Universidad Nacional Autónoma de México, 2001 THESIS Submitted in

More information

NETWORK OVERLAYS: AN INTRODUCTION

NETWORK OVERLAYS: AN INTRODUCTION NETWORK OVERLAYS: AN INTRODUCTION Network overlays dramatically increase the number of virtual subnets that can be created on a physical network, which in turn supports multitenancy and virtualization

More information

Chapter 13 TRANSPORT. Mobile Computing Winter 2005 / Overview. TCP Overview. TCP slow-start. Motivation Simple analysis Various TCP mechanisms

Chapter 13 TRANSPORT. Mobile Computing Winter 2005 / Overview. TCP Overview. TCP slow-start. Motivation Simple analysis Various TCP mechanisms Overview Chapter 13 TRANSPORT Motivation Simple analysis Various TCP mechanisms Distributed Computing Group Mobile Computing Winter 2005 / 2006 Distributed Computing Group MOBILE COMPUTING R. Wattenhofer

More information

Managing Caching Performance and Differentiated Services

Managing Caching Performance and Differentiated Services CHAPTER 10 Managing Caching Performance and Differentiated Services This chapter explains how to configure TCP stack parameters for increased performance ant throughput and how to configure Type of Service

More information

Mobile Communications Chapter 9: Mobile Transport Layer

Mobile Communications Chapter 9: Mobile Transport Layer Prof. Dr.-Ing Jochen H. Schiller Inst. of Computer Science Freie Universität Berlin Germany Mobile Communications Chapter 9: Mobile Transport Layer Motivation, TCP-mechanisms Classical approaches (Indirect

More information

Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck

Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck Volker Lindenstruth; lindenstruth@computer.org The continued increase in Internet throughput and the emergence of broadband access networks

More information

UNIT IV -- TRANSPORT LAYER

UNIT IV -- TRANSPORT LAYER UNIT IV -- TRANSPORT LAYER TABLE OF CONTENTS 4.1. Transport layer. 02 4.2. Reliable delivery service. 03 4.3. Congestion control. 05 4.4. Connection establishment.. 07 4.5. Flow control 09 4.6. Transmission

More information

Networking interview questions

Networking interview questions Networking interview questions What is LAN? LAN is a computer network that spans a relatively small area. Most LANs are confined to a single building or group of buildings. However, one LAN can be connected

More information

ECE 650 Systems Programming & Engineering. Spring 2018

ECE 650 Systems Programming & Engineering. Spring 2018 ECE 650 Systems Programming & Engineering Spring 2018 Networking Introduction Tyler Bletsch Duke University Slides are adapted from Brian Rogers (Duke) Computer Networking A background of important areas

More information

Data Link Layer. Our goals: understand principles behind data link layer services: instantiation and implementation of various link layer technologies

Data Link Layer. Our goals: understand principles behind data link layer services: instantiation and implementation of various link layer technologies Data Link Layer Our goals: understand principles behind data link layer services: link layer addressing instantiation and implementation of various link layer technologies 1 Outline Introduction and services

More information

Evaluation of a Zero-Copy Protocol Implementation

Evaluation of a Zero-Copy Protocol Implementation Evaluation of a Zero-Copy Protocol Implementation Karl-André Skevik, Thomas Plagemann, Vera Goebel Department of Informatics, University of Oslo P.O. Box 18, Blindern, N-316 OSLO, Norway karlas, plageman,

More information

CERN openlab Summer 2006: Networking Overview

CERN openlab Summer 2006: Networking Overview CERN openlab Summer 2006: Networking Overview Martin Swany, Ph.D. Assistant Professor, Computer and Information Sciences, U. Delaware, USA Visiting Helsinki Institute of Physics (HIP) at CERN swany@cis.udel.edu,

More information

EMP: Zero-copy OS-bypass NIC-driven Gigabit Ethernet Message Passing

EMP: Zero-copy OS-bypass NIC-driven Gigabit Ethernet Message Passing EMP: Zero-copy OS-bypass NIC-driven Gigabit Ethernet Message Passing Piyush Shivam Computer/Information Science The Ohio State University 2015 Neil Avenue Columbus, OH 43210 shivam@cis.ohio-state.edu Pete

More information

Advanced Computer Networks. End Host Optimization

Advanced Computer Networks. End Host Optimization Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct

More information

Module 16: Distributed System Structures

Module 16: Distributed System Structures Chapter 16: Distributed System Structures Module 16: Distributed System Structures Motivation Types of Network-Based Operating Systems Network Structure Network Topology Communication Structure Communication

More information

Outline 9.2. TCP for 2.5G/3G wireless

Outline 9.2. TCP for 2.5G/3G wireless Transport layer 9.1 Outline Motivation, TCP-mechanisms Classical approaches (Indirect TCP, Snooping TCP, Mobile TCP) PEPs in general Additional optimizations (Fast retransmit/recovery, Transmission freezing,

More information

CSE 4215/5431: Mobile Communications Winter Suprakash Datta

CSE 4215/5431: Mobile Communications Winter Suprakash Datta CSE 4215/5431: Mobile Communications Winter 2013 Suprakash Datta datta@cse.yorku.ca Office: CSEB 3043 Phone: 416-736-2100 ext 77875 Course page: http://www.cse.yorku.ca/course/4215 Some slides are adapted

More information

MIDTERM EXAMINATION #2 OPERATING SYSTEM CONCEPTS U N I V E R S I T Y O F W I N D S O R S C H O O L O F C O M P U T E R S C I E N C E

MIDTERM EXAMINATION #2 OPERATING SYSTEM CONCEPTS U N I V E R S I T Y O F W I N D S O R S C H O O L O F C O M P U T E R S C I E N C E MIDTERM EXAMINATION #2 OPERATING SYSTEM CONCEPTS 03-60-367-01 U N I V E R S I T Y O F W I N D S O R S C H O O L O F C O M P U T E R S C I E N C E Intersession 2008 Last Name: First Name: Student ID: PLEASE

More information

Announcements. No book chapter for this topic! Slides are posted online as usual Homework: Will be posted online Due 12/6

Announcements. No book chapter for this topic! Slides are posted online as usual Homework: Will be posted online Due 12/6 Announcements No book chapter for this topic! Slides are posted online as usual Homework: Will be posted online Due 12/6 Copyright c 2002 2017 UMaine Computer Science Department 1 / 33 1 COS 140: Foundations

More information

Introduction to Open System Interconnection Reference Model

Introduction to Open System Interconnection Reference Model Chapter 5 Introduction to OSI Reference Model 1 Chapter 5 Introduction to Open System Interconnection Reference Model Introduction The Open Systems Interconnection (OSI) model is a reference tool for understanding

More information

by Brian Hausauer, Chief Architect, NetEffect, Inc

by Brian Hausauer, Chief Architect, NetEffect, Inc iwarp Ethernet: Eliminating Overhead In Data Center Designs Latest extensions to Ethernet virtually eliminate the overhead associated with transport processing, intermediate buffer copies, and application

More information

Lessons learned from MPI

Lessons learned from MPI Lessons learned from MPI Patrick Geoffray Opinionated Senior Software Architect patrick@myri.com 1 GM design Written by hardware people, pre-date MPI. 2-sided and 1-sided operations: All asynchronous.

More information

Continuous Real Time Data Transfer with UDP/IP

Continuous Real Time Data Transfer with UDP/IP Continuous Real Time Data Transfer with UDP/IP 1 Emil Farkas and 2 Iuliu Szekely 1 Wiener Strasse 27 Leopoldsdorf I. M., A-2285, Austria, farkas_emil@yahoo.com 2 Transilvania University of Brasov, Eroilor

More information

CS4700/CS5700 Fundamentals of Computer Networks

CS4700/CS5700 Fundamentals of Computer Networks CS4700/CS5700 Fundamentals of Computer Networks Lecture 14: TCP Slides used with permissions from Edward W. Knightly, T. S. Eugene Ng, Ion Stoica, Hui Zhang Alan Mislove amislove at ccs.neu.edu Northeastern

More information

No book chapter for this topic! Slides are posted online as usual Homework: Will be posted online Due 12/6

No book chapter for this topic! Slides are posted online as usual Homework: Will be posted online Due 12/6 Announcements No book chapter for this topic! Slides are posted online as usual Homework: Will be posted online Due 12/6 Copyright c 2002 2017 UMaine School of Computing and Information S 1 / 33 COS 140:

More information

Advanced Computer Networking. CYBR 230 Jeff Shafer University of the Pacific QUIC

Advanced Computer Networking. CYBR 230 Jeff Shafer University of the Pacific QUIC CYBR 230 Jeff Shafer University of the Pacific QUIC 2 It s a Google thing. (Originally) 3 Google Engineering Motivations Goal: Decrease end-user latency on web To increase user engagement So they see more

More information

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ Networking for Data Acquisition Systems Fabrice Le Goff - 14/02/2018 - ISOTDAQ Outline Generalities The OSI Model Ethernet and Local Area Networks IP and Routing TCP, UDP and Transport Efficiency Networking

More information

Introduction to Ethernet Latency

Introduction to Ethernet Latency Introduction to Ethernet Latency An Explanation of Latency and Latency Measurement The primary difference in the various methods of latency measurement is the point in the software stack at which the latency

More information

NT1210 Introduction to Networking. Unit 10

NT1210 Introduction to Networking. Unit 10 NT1210 Introduction to Networking Unit 10 Chapter 10, TCP/IP Transport Objectives Identify the major needs and stakeholders for computer networks and network applications. Compare and contrast the OSI

More information

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G 10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G Mohammad J. Rashti and Ahmad Afsahi Queen s University Kingston, ON, Canada 2007 Workshop on Communication Architectures

More information

Mobile Transport Layer

Mobile Transport Layer Mobile Transport Layer 1 Transport Layer HTTP (used by web services) typically uses TCP Reliable transport between TCP client and server required - Stream oriented, not transaction oriented - Network friendly:

More information

Multiple unconnected networks

Multiple unconnected networks TCP/IP Life in the Early 1970s Multiple unconnected networks ARPAnet Data-over-cable Packet satellite (Aloha) Packet radio ARPAnet satellite net Differences Across Packet-Switched Networks Addressing Maximum

More information

Performance Evaluation of Myrinet-based Network Router

Performance Evaluation of Myrinet-based Network Router Performance Evaluation of Myrinet-based Network Router Information and Communications University 2001. 1. 16 Chansu Yu, Younghee Lee, Ben Lee Contents Suez : Cluster-based Router Suez Implementation Implementation

More information

Reliable Transport I: Concepts and TCP Protocol

Reliable Transport I: Concepts and TCP Protocol Reliable Transport I: Concepts and TCP Protocol Brad Karp UCL Computer Science CS 3035/GZ01 29 th October 2013 Part I: Transport Concepts Layering context Transport goals Transport mechanisms 2 Context:

More information

Network Adapter. Increased demand for bandwidth and application processing in. Improve B2B Application Performance with Gigabit Server

Network Adapter. Increased demand for bandwidth and application processing in. Improve B2B Application Performance with Gigabit Server Improve B2B Application Performance with Gigabit Server Network Adapter By Uri Elzur Business-to-business (B2B) applications and gigabit networking speeds increase the load on server CPUs. These challenges

More information

Lecture 3. The Network Layer (cont d) Network Layer 1-1

Lecture 3. The Network Layer (cont d) Network Layer 1-1 Lecture 3 The Network Layer (cont d) Network Layer 1-1 Agenda The Network Layer (cont d) What is inside a router? Internet Protocol (IP) IPv4 fragmentation and addressing IP Address Classes and Subnets

More information

Computer Networks Principles

Computer Networks Principles Computer Networks Principles Introduction Prof. Andrzej Duda duda@imag.fr http://duda.imag.fr 1 Contents Introduction protocols and layered architecture encapsulation interconnection structures performance

More information

First Exam for ECE671 Spring /22/18

First Exam for ECE671 Spring /22/18 ECE67: First Exam First Exam for ECE67 Spring 208 02/22/8 Instructions: Put your name and student number on each sheet of paper! The exam is closed book. You have 75 minutes to complete the exam. Be a

More information

Group Management Schemes for Implementing MPI Collective Communication over IP Multicast

Group Management Schemes for Implementing MPI Collective Communication over IP Multicast Group Management Schemes for Implementing MPI Collective Communication over IP Multicast Xin Yuan Scott Daniels Ahmad Faraj Amit Karwande Department of Computer Science, Florida State University, Tallahassee,

More information

QuickSpecs. HP Z 10GbE Dual Port Module. Models

QuickSpecs. HP Z 10GbE Dual Port Module. Models Overview Models Part Number: 1Ql49AA Introduction The is a 10GBASE-T adapter utilizing the Intel X722 MAC and X557-AT2 PHY pairing to deliver full line-rate performance, utilizing CAT 6A UTP cabling (or

More information

Performance Evaluation of InfiniBand with PCI Express

Performance Evaluation of InfiniBand with PCI Express Performance Evaluation of InfiniBand with PCI Express Jiuxing Liu Amith Mamidala Abhinav Vishnu Dhabaleswar K Panda Department of Computer and Science and Engineering The Ohio State University Columbus,

More information

High Performance Computing: Concepts, Methods & Means Enabling Technologies 2 : Cluster Networks

High Performance Computing: Concepts, Methods & Means Enabling Technologies 2 : Cluster Networks High Performance Computing: Concepts, Methods & Means Enabling Technologies 2 : Cluster Networks Prof. Amy Apon Department of Computer Science and Computer Engineering University of Arkansas March 15 th,

More information

AN exam March

AN exam March AN exam March 29 2018 Dear student This exam consists of 7 questions. The total number of points is 100. Read the questions carefully. Be precise and concise. Write in a readable way. Q1. UDP and TCP (25

More information

Introduction to Protocols

Introduction to Protocols Chapter 6 Introduction to Protocols 1 Chapter 6 Introduction to Protocols What is a Network Protocol? A protocol is a set of rules that governs the communications between computers on a network. These

More information

AN ABSTRACT OF THE THESIS OF

AN ABSTRACT OF THE THESIS OF AN ABSTRACT OF THE THESIS OF Richard Edgecombe for the degree of Master of Science in Computer Science presented on March 17, 2008. Title: An Implementation of a Reliable Broadcast Scheme for 802.11 using

More information

CS 3640: Introduction to Networks and Their Applications

CS 3640: Introduction to Networks and Their Applications CS 3640: Introduction to Networks and Their Applications Fall 2018, Lecture 5: The Link Layer I Errors and medium access Instructor: Rishab Nithyanand Teaching Assistant: Md. Kowsar Hossain 1 You should

More information

Module 15: Network Structures

Module 15: Network Structures Module 15: Network Structures Background Topology Network Types Communication Communication Protocol Robustness Design Strategies 15.1 A Distributed System 15.2 Motivation Resource sharing sharing and

More information

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Dept. of Computer Science Florida State University Tallahassee, FL 32306 {karwande,xyuan}@cs.fsu.edu

More information

Virtualization, Xen and Denali

Virtualization, Xen and Denali Virtualization, Xen and Denali Susmit Shannigrahi November 9, 2011 Susmit Shannigrahi () Virtualization, Xen and Denali November 9, 2011 1 / 70 Introduction Virtualization is the technology to allow two

More information

CH : 15 LOCAL AREA NETWORK OVERVIEW

CH : 15 LOCAL AREA NETWORK OVERVIEW CH : 15 LOCAL AREA NETWORK OVERVIEW P. 447 LAN (Local Area Network) A LAN consists of a shared transmission medium and a set of hardware and software for interfacing devices to the medium and regulating

More information

Lixia Zhang M. I. T. Laboratory for Computer Science December 1985

Lixia Zhang M. I. T. Laboratory for Computer Science December 1985 Network Working Group Request for Comments: 969 David D. Clark Mark L. Lambert Lixia Zhang M. I. T. Laboratory for Computer Science December 1985 1. STATUS OF THIS MEMO This RFC suggests a proposed protocol

More information

IsoStack Highly Efficient Network Processing on Dedicated Cores

IsoStack Highly Efficient Network Processing on Dedicated Cores IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single

More information

RDMA-like VirtIO Network Device for Palacios Virtual Machines

RDMA-like VirtIO Network Device for Palacios Virtual Machines RDMA-like VirtIO Network Device for Palacios Virtual Machines Kevin Pedretti UNM ID: 101511969 CS-591 Special Topics in Virtualization May 10, 2012 Abstract This project developed an RDMA-like VirtIO network

More information

SCTP s Reliability and Fault Tolerance

SCTP s Reliability and Fault Tolerance SCTP s Reliability and Fault Tolerance Brad Penoff, Mike Tsai, and Alan Wagner Department of Computer Science University of British Columbia Vancouver, Canada Distributed Systems Group Seattle Conference

More information

Infiniband Fast Interconnect

Infiniband Fast Interconnect Infiniband Fast Interconnect Yuan Liu Institute of Information and Mathematical Sciences Massey University May 2009 Abstract Infiniband is the new generation fast interconnect provides bandwidths both

More information

Midterm #2 Exam Solutions April 26, 2006 CS162 Operating Systems

Midterm #2 Exam Solutions April 26, 2006 CS162 Operating Systems University of California, Berkeley College of Engineering Computer Science Division EECS Spring 2006 Anthony D. Joseph Midterm #2 Exam April 26, 2006 CS162 Operating Systems Your Name: SID AND 162 Login:

More information

Lesson 2-3: The IEEE x MAC Layer

Lesson 2-3: The IEEE x MAC Layer Module 2: Establishing Wireless Connectivity Lesson 2-3: The IEEE 802.11x MAC Layer Lesson Overview This lesson describes basic IEEE 802.11x MAC operation, beginning with an explanation of contention schemes

More information

Network Design Considerations for Grid Computing

Network Design Considerations for Grid Computing Network Design Considerations for Grid Computing Engineering Systems How Bandwidth, Latency, and Packet Size Impact Grid Job Performance by Erik Burrows, Engineering Systems Analyst, Principal, Broadcom

More information

Mobile Communications Chapter 9: Mobile Transport Layer

Mobile Communications Chapter 9: Mobile Transport Layer Prof. Dr.-Ing Jochen H. Schiller Inst. of Computer Science Freie Universität Berlin Germany Mobile Communications Chapter 9: Mobile Transport Layer Motivation, TCP-mechanisms Classical approaches (Indirect

More information

A Low Latency Solution Stack for High Frequency Trading. High-Frequency Trading. Solution. White Paper

A Low Latency Solution Stack for High Frequency Trading. High-Frequency Trading. Solution. White Paper A Low Latency Solution Stack for High Frequency Trading White Paper High-Frequency Trading High-frequency trading has gained a strong foothold in financial markets, driven by several factors including

More information

II. Principles of Computer Communications Network and Transport Layer

II. Principles of Computer Communications Network and Transport Layer II. Principles of Computer Communications Network and Transport Layer A. Internet Protocol (IP) IPv4 Header An IP datagram consists of a header part and a text part. The header has a 20-byte fixed part

More information

Network Management & Monitoring

Network Management & Monitoring Network Management & Monitoring Network Delay These materials are licensed under the Creative Commons Attribution-Noncommercial 3.0 Unported license (http://creativecommons.org/licenses/by-nc/3.0/) End-to-end

More information

UNIT 2 TRANSPORT LAYER

UNIT 2 TRANSPORT LAYER Network, Transport and Application UNIT 2 TRANSPORT LAYER Structure Page No. 2.0 Introduction 34 2.1 Objective 34 2.2 Addressing 35 2.3 Reliable delivery 35 2.4 Flow control 38 2.5 Connection Management

More information

Sena Technologies White Paper: Latency/Throughput Test. Device Servers/Bluetooth-Serial Adapters

Sena Technologies White Paper: Latency/Throughput Test. Device Servers/Bluetooth-Serial Adapters Sena Technologies White Paper: Latency/Throughput Test of October 30, 2007 Copyright Sena Technologies, Inc 2007 All rights strictly reserved. No part of this document may not be reproduced or distributed

More information

Programming Assignment 3: Transmission Control Protocol

Programming Assignment 3: Transmission Control Protocol CS 640 Introduction to Computer Networks Spring 2005 http://www.cs.wisc.edu/ suman/courses/640/s05 Programming Assignment 3: Transmission Control Protocol Assigned: March 28,2005 Due: April 15, 2005, 11:59pm

More information