SPLINTERING TCP TO DECREASE SMALL MESSAGE LATENCY IN HIGH-PERFORMANCE COMPUTING BREANNE DUNCAN THESIS

SPLINTERING TCP TO DECREASE SMALL MESSAGE LATENCY IN HIGH-PERFORMANCE COMPUTING by BREANNE DUNCAN THESIS Submitted in Partial Fulfillment of the Requirements for the Degree of Bachelors of Science Computer Science The University of New Mexico Albuquerque, New Mexico May, 2003

c 2003, Breanne Duncan iii

Dedication For my mother, who I helped copy BASIC code out of magazines onto our Apple IIc when I was in kindergarten. Without her inspiration, my love for computing may have never been born. iv

Acknowledgments Foremost, I would like to thank my advisor Prof. Barney Maccabe for introducing me to the field of computer systems and networks, and allowing me the opportunity of doing undergraduate research. I owe many thanks to Patricia Crowley, my mentor, for dedicating much time to my success on this project and guiding me through my first research experience. Wenbin Zhu also was of much help in desperate times, as was Edgar León. v

SPLINTERING TCP TO DECREASE SMALL MESSAGE LATENCY IN HIGH-PERFORMANCE COMPUTING by BREANNE DUNCAN ABSTRACT OF THESIS Submitted in Partial Fulfillment of the Requirements for the Degree of Bachelors of Science Computer Science The University of New Mexico Albuquerque, New Mexico May, 2003

SPLINTERING TCP TO DECREASE SMALL MESSAGE LATENCY IN HIGH-PERFORMANCE COMPUTING by BREANNE DUNCAN B.S., Computer Science, University of New Mexico, 2003 Abstract The adoption of commodity networking protocols and hardware in high-performance computing yields low-cost alternatives to expensive proprietary communications infrastructures. However, commodity protocols do not inherently have the advantages of low overhead, low latency, and offloaded control programs associated with specialty components, such as Myrinet. Many high-performance computing applications depend on the rapid transmission of small messages between processors. Hosts in a high-performance computing environment using gigabit Ethernet and TCP/IP need mechanisms to acknowledge small messages quickly without engaging the operating system. By splintering TCP and transferring acknowledgment capabilities to the network interface card (NIC), an application avoids costly interrupts and context switches into the operating system that incur delay. Acknowledgments are sent earlier, allowing the sender to advance its TCP flow control window more frequently. Results from a proof of concept model suggest that small message acknowledgment latency can be reduced by approximately 40% through splintering the TCP stack. vii

Contents 1 Introduction 1 2 Motivations 4 2.1 Host Bottlenecks.............................. 4 2.1.1 Transmission Latency....................... 5 2.1.2 TCP Performance Issues...................... 6 3 Modifications for High-Performance Computing 8 3.1 Protocol Offloading............................. 8 3.2 OS Bypass.................................. 9 3.2.1 Zero-Copy Data Movement.................... 9 3.2.2 Advantages............................. 10 3.2.3 Disadvantages........................... 10 3.2.4 Partial OS Bypass......................... 11 3.3 Splintering.................................. 11 viii

Contents 4 Splintering TCP Acknowledgments to Decrease Latency 13 4.1 Error checking............................... 14 4.2 Assumptions................................ 15 4.3 Implementation............................... 16 4.4 Performance Testing............................ 19 5 Results 21 6 Related Work 24 6.1 MPICH Application-Bypass........................ 24 6.2 Offloading RMPP RTS/CTS........................ 25 6.3 Trapeze: Optimizations for TCP/IP over Gigabit Networks........ 26 6.4 Offloading IP Fragmentation and Reassembly............... 27 7 Conclusions 29 References 30 ix

Chapter 1 Introduction High-performance computing is typically characterized by specially designed communications hardware and protocols. These features support maximum utilization of both node processing power and provide high communication bandwidth between nodes. The lowlatency transmission that many scientific applications depend on is inherent in such networks. While this approach may provide the best technology available at any given time, the high expense of implementing such a system is a deterrent in many circumstances. Typically only large institutions have access to specialized, proprietary hardware and protocols. A rapidly advancing alternative involves the use of long-established networking protocols running on commodity hardware. The advantages of a commodity approach are numerous. The cost of hardware is an obvious comparison. Commodity network cards, switches, and cabling are vastly less expensive than their proprietary counterparts. As of March 2003, gigabit Ethernet cards could be purchased for as low as $30[6], while Myrinet cards were around $1000[13]. Those who previously could not afford high-performance computing soon may be able to due to the low cost of commodity components. Cost is not the only advantage of using commodity components. Commodity products 1

Chapter 1. Introduction are readily available in abundance and many people are skilled in their use. Modification and maintenance of network stack and programmable network interface card (NIC) code is simplified through this ubiquitous knowledge. This speeds research and development times in the context of specialization for high-performance clusters. Commodity hardware and protocols are highly interoperable. This interoperability extends horizontally between host nodes in a network and vertically between network architecture layers. Commodity components are obviously not tailored to a high-performance computing environment. When commodity components are used without modification, one cannot expect satisfactory performance. It is necessary to change the functionality of networking hardware. One must tailor commodity protocols to respond to a high-performance environment. In theory, this approach yields the advantages of commodity components and the power of specialty components. Many high-performance computing applications rely on rapid communication of small messages between compute nodes. Because large quantities of such messages are sent, it is important that message latency be as low as possible. Decreased latency yields an increase in computational efficiency. The processor spends less time waiting to receive data before continuing computation. However, low latency is not inherent in commodity protocols and networks. Modifications must be made to protocol stacks and NIC and application interaction to ensure low latency while maintaining high bandwidth. The network interface card must take on responsibilities normally delegated to the host operating system (OS). This reduces operating system overhead and decreases message latency by maintaining a more direct link between the network and the process. High-performance applications and libraries usually differentiate between small and large messages and handle them very differently. For example, Sun MPI treats packets under 1KB as short messages, while larger packets are treated as long messages. Access permissions and memory locations for stored payloads are different for each message type. In this work, a small message refers to a packet less than 1500 bytes, the maximum size 2

Chapter 1. Introduction of an Ethernet frame. This work discusses latency reduction for small message traffic. Chapter 2 is devoted to the problems of host bottlenecks and transmission latency. Chapter 3 discusses modifications to tailor commodity protocols to high-performance computing. Chapter 4 looks at splintering TCP acknowledgments, a method proposed to decrease transmission latency in a high-performance infrastructure built on commodity components. Chapter 5 explores results from testing this model. Chapter 6 discusses related work. The final chapter presents conclusions about the efficacy of splintering TCP acknowledgments. 3

Chapter 2 Motivations Commodity components and protocols do not receive and process packets efficiently in a high-performance environment. The overhead incurred through processing a received packet in the operating system and copying data to the application increases latency dramatically. TCP and gigabit Ethernet packet handling must evolve to accommodate highperformance computing needs. The path between the network interface card and the application must minimize latency without putting undue strain on the NIC or host processor. 2.1 Host Bottlenecks High-speed networks, such as 10Gbps Ethernet, now allow us rapid point-to-point communication. However, we are overlooking bottlenecks internal to the host that seriously diminish communications efficiency. Culler et. al identify communication delay, overhead, and gap as key parameters in defining parallel computing performance[3]. Highperformance applications are sensitive to these metrics[12]. Low latency and overhead are necessary for efficient computation and communication[14]. An increase in any parameter may decrease overall performance. 4

Chapter 2. Motivations Culler et. al define latency as a metric pertaining to NIC-to-NIC transmissions. Because my work deals with application-to-application message transfer, I define latency as NIC-to-NIC transmission time plus time elapsed during communications processing at a host system. This gives a more realistic assessment of the time it takes data to travel between processes in a high-performance computing infrastructure. Although bandwidth and processor speeds continually increase, overhead and high application-to-application message latency remain a problem. A high-speed line sends small messages so quickly that it creates a massive amount of interrupt pressure within the kernel space of the receiving host[11][8]. Furthermore, the kernel must deal with processing packets and moving message payloads into user space after receipt and error checking. High communications overhead can overwhelm the processor and prevent it from spending valuable time on computation. This processing time also prevents messages from being delivered to the application quickly. Due to these blockades, applications cannot harness the bandwidth and speed the network provides[7, 3]. 2.1.1 Transmission Latency Many high-performance, scientific computing applications depend on rapid, low-latency transmission of messages between processors. High message latency leads to CPU idling and wasted resources. The application may wait for message arrival before continuing computation[3]. Ensuring consistently low message latency is key for computations linearly sensitive to communications delay, such as fast Fourier transforms, and jittersensitive, real-time applications. Processes frequently sending small messages between nodes suffer performance loss in a high-latency environment. Because much of their focus is on communication rather than strictly computation, inefficiencies in the network infrastructure or host message processing system negatively affect computational efficiency as well. Time spent waiting 5

Chapter 2. Motivations to receive messages is time wasted in terms of computation. Serially processing large numbers of small messages in turn can overwhelm the CPU and monopolize computing resources. A high-latency network prevents scalability as well. Adding more nodes to a network necessitates more communication and thus presents an additive increase in latency. Many scientific applications rely on small message transmission to gather values computed on other nodes for local computation. The fast Fourier transform is frequently used in scientific, high-performance computing. This operation is very sensitive to latency due to the small data messages passed between processes[9]. SMG2000, LU, and EM3D are examples of prominent latency-sensitive scientific benchmark applications. SMG2000, a supercomputing benchmark, is a parallel semicoarsening multigrid solver for the linear systems of a particular diffusion equation. Nodes running SMG2000 often need current approximate solutions from other processes before performing computations locally[10]. This makes the computationally-intensive program a communication-intensive program as well. LU is a NASA Advanced Supercomputing Parallel Benchmark (NPB 2) program that solves a finite difference discretization of the 3-D compressible Navier-Stokes equations. At each iteration of the algorithm, many small messages are sent between processes[4]. EM3D is the kernel of an application that models propagation of electromagnetic waves through three-dimensional objects. It is linearly sensitive to latency and the number of messages sent per processor is extremely high (124.76 msg/proc/ms)[12]. 2.1.2 TCP Performance Issues Reliability and connection costs make TCP a poor choice for high-performance computing. Flow control, congestion control, and error-checking increase message latency. Connection maintenance presents scalability issues. For efficient communication, there must exist a connection between each pair of nodes. In an n node system, this incurs an n cost 6

Chapter 2. Motivations for connection and disconnection. A three-way handshake between each pair of nodes adds significant time and complexity to system startup. Closing connections presents a similar problem. 7

Chapter 3 Modifications for High-Performance Computing Many modifications of TCP/IP that increase protocol viability in high-performance computing have been suggested. The majority use protocol offloading to distribute network stack functionality over available hardware. Interrupt coalescing, offloaded packet fragmentation and reassembly, zero-copy data movement, and offloaded error checking have been used to increase commodity protocol performance. These techniques help increase processor effective utilization and communication bandwidth. They can increase application performance by reducing message latency and communication overhead for host processors. These optimizations alleviate problems that previously restricted commodity protocols and hardware from achieving peak performance. 3.1 Protocol Offloading Offloading parts of the networking protocol stack naturally transfers some of the work load from the CPU to the NIC processor. TCP and IP protocols, in whole or part, can be moved 8

Chapter 3. Modifications for High-Performance Computing to the network interface card to take strain off the host processor. A total offload strategy involves moving all protocol functionality onto the NIC. Alternatively, a protocol can be partially offloaded. In this method, it is broken up into separate pieces and spread across available resources. Research efforts have involved offloading parts of the IP protocol, such as error checking and packet fragmentation and reassembly. 3.2 OS Bypass Offloading can be used as a tool to allow communications tasks to bypass the operating system. Offloaded control allows the NIC to perform tasks normally delegated to the OS. After an application initially acquires a resource through the OS, the application is free to communicate with the resource directly without operating system intervention. In terms of latency and overhead, this strategy presents a one-time cost, rather than a per-packet cost. 3.2.1 Zero-Copy Data Movement Using OS bypass, the NIC and application have a direct communication link. A zero-copy strategy can be used to move data directly from the NIC to user space. An application expecting to receive data can send buffer descriptors to the NIC, informing it where to place data. The NIC places packet payloads in the appropriate location in user space. For fragmented packets, this requires the network interface card to perform message reassembly. Traditional, non-zero-copy mechanisms entail the network interface card sending packets to the OS, rather than directly to the application. The operating system processes a packet before sending the payload to the application. This involves performing any necessary error checking, packet reassembly, or adjustment of reliability parameters, such as advancing the sender s TCP flow control window. 9

Chapter 3. Modifications for High-Performance Computing 3.2.2 Advantages OS bypass lowers message latency and CPU overhead. The zero-copy infrastructure prevents CPU and memory bus involvement in transferring received data into user space. The CPU is free to devote itself strictly to computation, rather than message processing and payload movement. In the context of small messages, OS Bypass helps messages quickly move from the NIC to the application without going through a middleman. Because the amount of data in each message is small, it is latency and message processing, not data processing, that presents a problem. Each small message that arrives must interrupt the operating system, incurring a per-packet cost in terms of latency and overhead. This is not as problematic with large packets that arrive less frequently[5]. Although network bandwidth may be at an acceptable rate, NIC to application packet transmission may incur unacceptable costs going through the OS. Network performance is limited not by the hardware, but instead by host bottlenecks[2]. 3.2.3 Disadvantages In a total OS bypass model, the NIC controls communications entirely after resource acquisition. All protocol processing is handled by the NIC. Ethernet Message Passing (EMP)[14] is one implementation of this strategy. EMP delegates virtual memory management to the application, while descriptors are handled by both the application and the NIC. The operating system does not intervene in communications and zero-copy is used for data transfer. Moving too much functionality to the NIC does not optimize resource use or performance. NIC processing power is dwarfed by that of a host system. For example, current Alteon ACENIC gigabit Ethernet cards have only two 88MHz processors and 2MB local RAM. Intense involvement in communications processing can overwhelm NIC resources. While EMP shows an improvement from a traditional setup in terms of small message 10

Chapter 3. Modifications for High-Performance Computing latency and bandwidth[14], other methods that use partial protocol offloading have been more successful[5]. 3.2.4 Partial OS Bypass Partial OS bypass involves offloading some, but not all tasks to the NIC to take advantage of NIC and CPU resources at an optimal level. Under partial OS bypass, the operating system is still involved in communications tasks. However, it is interrupted much less frequently and spends less computation cycles dealing with communication. It also does not force applications to use a particular protocol, as does total OS bypass when offloading the entire protocol stack[8]. 3.3 Splintering Splintering combines partial offloading and partial OS bypass methods to optimize communication between the network and the application. The kernel protocol stack is splintered, or broken apart, and tasks are offloaded to both the NIC and application. The operating system maintains control of resources, dealing with connection management, TCP reliability metrics, memory management, and scheduling. Pushing these concerns onto the network interface card would force the NIC to understand global resource management and resource protection. While it is possible to implement this, it would complicate NIC operation immensely and put strain on its limited processing power. Under splintering, the application notifies the network interface card of expected packets. It sends buffer descriptors to the NIC, giving the location at which to place packet payloads. The NIC moves messages directly to the application without intermediate buffer copies. This zero-copy strategy reduces delay for message transfer from the NIC to the application. After data transfer, the NIC may send message headers to the OS, or coa- 11

Chapter 3. Modifications for High-Performance Computing lesce them to send at a later time. Sending headers to the operating system ensures that the OS has current communications information. Figure 3.1 shows the splintered TCP architecture. Operating System socket read header Application receive buffer receive descriptor data packet NIC Figure 3.1: The splintered TCP architecture. I propose to splinter the TCP stack by offloading TCP acknowledgment (ACK) capabilities to the NIC. Sending ACKs at the NIC level, instead of in the kernel, decreases operating system involvement in message receipt. Application-to-application communication latency decreases and processor effective utilization increases. 12

Chapter 4 Splintering TCP Acknowledgments to Decrease Latency Offloading TCP acknowledgment functionality allows the network interface card to acknowledge packets as soon as they arrive. Normally, acknowledgments are made at the kernel level after packet processing takes place and data is sent to the application. The OS checks the payload in user space for errors before sending an ACK to the originating machine. This sequence of events increases acknowledgment latency substantially. Splintering TCP acknowledgments reduces application-to-application communication delay. Acknowledging packets at the NIC allows ACK messages to be received earlier at a peer node. In the context of small messages arriving frequently, low-latency ACKs are vital to keep nodes synchronized efficiently. Upon acknowledgment arrival, the OS can adjust the connection s TCP flow control window. Thus, the window is advanced more frequently and more data is sent. With enough optimization, the network becomes bandwidth-limited rather than latency-limited in terms of application-to-application communications speed. Figure 4.1 contrasts this strategy with the common TCP acknowledgment method. 13

Chapter 4. Splintering TCP Acknowledgments to Decrease Latency Operating System Operating System data (3) Application packet (2) ACK (4) receive header (4) buffer Application receive buffer ACK (5) ACK (2) data (3) packet (1) NIC packet (1) NIC EXISTING TCP ACKNOWLEDGMENT PATH PROPOSED TCP ACKNOWLEDGMENT PATH Figure 4.1: Normal (left) and splintered (right) TCP acknowledgment processes. This optimization is highly interoperable between compute nodes. Splintering TCP acknowledgments on one node is transparent to all other nodes. The TCP and IP protocols do not appear modified from outside the given node. Thus, splintering is not required for all nodes in a system. Communications performance can be improved if only one host in a pair of connected nodes has a splintered TCP stack. Splintering TCP acknowledgments is a receive-side optimization. The sender must still handle incoming acknowledgments explicitly in the operating system because the NIC does not handle TCP control messages. 4.1 Error checking TCP error checking is splintered and delegated to the application. An application can checksum while reading data from the receive buffer. Alternately, the application can choose to avoid error checking entirely if desired. This strategy is useful for real-time applications in which checksumming adds unacceptable delay. Moving error checking to the application level reduces TCP to a protocol that is no longer reliable between processes. Applications that need this assurance must provide other reliability mechanisms. 14

Chapter 4. Splintering TCP Acknowledgments to Decrease Latency Removing error checking from the kernel protocol stack absolves the need for each packet to interrupt the OS. Also, the message does not need to be copied to kernel space. This decreases host interrupt pressures and increases CPU availability. The NIC avoids involvement in error checking as well. Research has shown that error checking is too computationally intensive for NIC processors[5]. Offloading the task to the NIC decreases system performance. Moreover, delegating error handling and correction to the NIC is extremely inefficient. The NIC is forced to handle a large array of conditions that occur infrequently and require much processing. Acknowledging packets at the NIC diminishes the reliability of TCP because ACKs are sent before message data is confirmed error-free. Thus the TCP protocol is modified and no longer adheres to its original node-to-node error-free transmission guarantee. However, this method reduces message latency and errors may be handled through other mechanisms. 4.2 Assumptions Following Amdahl s law, splintering focuses on a performance increase for the common case. For latency-sensitive, high-performance applications, this entails offloading operations that would allow small data messages to arrive at the application sooner. Small messages represent the bulk of traffic for such applications. Packet corruption on the PCI bus is handled by the application. Out-of-order packets are handled by the OS. This strategy relies on the fact that the vast majority of packets arrive error-free and in correct order. Optimization of these infrequent events would not yield a large performance gain, and would give undue complexity to the model. Splintering focuses on improving common case performance. TCP control message arrival is not a common case, and is therefore handled by the op- 15

Chapter 4. Splintering TCP Acknowledgments to Decrease Latency erating system. The NIC sends connection and disconnection requests and other connection management messages directly to the OS. The network interface card should handle TCP connection and disconnection requests efficiently. Checking each message to see if it deals with connection control does not add an unacceptable amount of overhead and latency to message receipt and processing. Message headers are already checked to identify the connection to which an arriving packet belongs. If the packet does not pertain to the high-performance application that is being optimized, it is passed to the operating system. Thus checking control bits in the message header to determine message type would not add delay. The reliability of wired networks suggests that packets are rarely, if ever, corrupted in transport. If a message is corrupted on the wire between nodes, the NIC is likely to detect the error while performing the cyclic redundancy check (CRC). The packet is then dropped before an acknowledgment is sent. This provides the standard TCP error control mechanism: if no ACK is received, the packet is retransmitted. The ACK is sent in error only if the packet payload is corrupted on the PCI bus while traveling from the NIC to user space. This rarely occurs. If an application needs complete end-to-end error checking, it can perform TCP checksumming on received payloads. If an error is detected, another protocol must be used to request retransmission. In splintering TCP acknowledgments, end-to-end reliability is sacrificed for latency reduction. 4.3 Implementation My research involves building a proof-of-concept modeling showing reduced latency for acknowledgments made at the NIC level. To do this, I measured the amount of latency incurred as a small (64-byte) message passes from the NIC to an application. Comparative analysis demonstrates the value of sending ACKs at a lower level of the network infrastructure. 16

Chapter 4. Splintering TCP Acknowledgments to Decrease Latency My model consists of one host acting as an echo client to send small packets to a second host. The second host echoes back these packets at the NIC and application levels. All messages in a batch are echoed off of only one level to prevent an additive increase in latency from the echo operation. Echo server functionality was added to the programmable NIC s firmware receive routine. A simple program acting as an echo server sends back packets at the application level. When echoing at the NIC, received messages are not passed up to higher layers. The machine acting as an echo client has unmodified NIC firmware and an unmodified network stack. Figure 4.2 shows echo locations. Application Application OS OS NIC NIC ECHO CLIENT ECHO SERVER Figure 4.2: Echo server model. The NIC echo process consists of simply swapping source and destination addresses in the packet header. Ethernet MAC and IP addresses undergo this swap. Source and destination port numbers in the transport layer header are switched as well. After these changes are made, the echo packet is enqueued in the NIC MAC engine to be sent. The DMA engine is not activated to send the packet to the OS as it normally would. Application-level echoing does not involve explicitly modifying packet contents. The echo program receives a payload through a socket and then calls a user-level UDP send function to transmit the payload to the originating machine. Pseudocode for measurements is shown in Figure 4.3. 10-byte buffers represent 10-byte packet payloads that, when combined with Ethernet, IP, and UDP headers, yield a 64-byte packet. This process models sending TCP acknowledgments after the message has traveled 17

Chapter 4. Splintering TCP Acknowledgments to Decrease Latency process echo client char buffer[10]; time t latency[n]; time t t1, t2; for( int i=0; i n; i++ ) t1 = gettime(); send buffer; receive buffer; t2 = gettime(); latency[i] = t2 - t1; process NIC echo char buf[recv MSG SIZE]; byte tmp48[6]; byte tmp32[4]; byte tmp16[2]; tmp48 = *(buf+src MAC addr); *(buf+src MAC addr) = *(buf+dst MAC addr); *(buf+dst MAC addr) = tmp48; tmp32 = *(buf+src IP addr); *(buf+src IP addr) = *(buf+dst IP addr); *(buf+dst IP addr) = tmp32; tmp16 = *(buf+src port); *(buf+src port) = *(buf+dst port); *(buf+dst port) = tmp16; set frame length(); set DMA read location(); enqueue MAC transmission(&buf); process APP echo char buffer[10]; while( 1 ) receive buffer; send buffer; Figure 4.3: Pseudocode for echo testing. to the given layer. Acknowledgments are sent directly from NIC upon message receipt. At the application level, ACKs are sent from the TCP stack in the kernel as per normal protocol operation. However, the message must be received by the application before this occurs. Thus echoing messages at the application includes the latency incurred from the OS passing packet data to the application before it sends an ACK. In principle, sending an echo packet at each layer mimics sending an ACK after a packet or payload arrives at the given layer. Because this is a proof-of-concept model and not a simulation, several issues are ignored in implementation. A zero-copy mechanism is not implemented. The echo server application does not send buffer descriptors to the network interface card to notify it of impending message receipt. No pages are pinned for the NIC to transfer packet payloads to user space. When echoing at the NIC, message headers are not sent to the OS because packets do not travel past the NIC level. In an application-level echo, the echo server does 18

Chapter 4. Splintering TCP Acknowledgments to Decrease Latency not perform error-checking. However, error checking costs are included because checksumming is performed in the OS before the packet is passed up. Connection multiplexing is also ignored in this model. 4.4 Performance Testing Latency measurements for an implementation of the proof-of-concept model were gathered using ping-pong testing through the echo model described in section 4.3. Data was gathered from the NIC and application levels. I ran the experiments using two Dell Precision WorkStation 620 MT machines with 933MHz Pentium III processors. Each machine has 256MB of RAM, a 64-bit, 66MHz PCI bus, and a Alteon ACENIC Gigabit Ethernet adapter with a Tigon II chipset. The machines network interfaces were directly connected no switches or routers were present as intermediaries. Both systems run the Linux 2.4.0 kernel and Red Hat 7.0. Each machine had a warmed cache and a static ARP table populated with the peer machine s data. Interrupt coalescing was turned off during testing. The echo server and client applications were the only explicit user process running on the machines. Because this splintering model focuses on optimizing data messages rather than control messages, UDP packets can be used for testing even though the focus of this work is TCP. Between connection and disconnection, TCP data packets represent the majority of messages being sent. UDP and TCP data packets are nearly the same in format. Thus message latencies would be almost identical if TCP packets were used for testing and connection issues ignored. A simple user-level program acted as an echo client on the machine with unmodified protocol stacks and NIC firmware. UDP packets were sent through a datagram socket. These 64-byte messages traveled through the normal OS network stack to the echoing 19

Chapter 4. Splintering TCP Acknowledgments to Decrease Latency machine. Latency was measured as the time elapsed between sending a message and receiving an echoed copy at the application. Iterations of 100, 1000, and 5000 messages were sent sequentially to get accurate latency measurements and observe change in latency across time and message cluster size. Latency measurements from this testing model should be very similar to those in a real implementation of the splintered TCP strategy. Splintering TCP acknowledgments seeks to remove error checking and flow control overhead from the receive-side latency path. UDP does not involve checksumming 1 or any reliability mechanisms. Thus this model simulates TCP acknowledgments being made at the NIC before any of these operations would take place. This is because the network interface card simply echoes the message at the same point that an ACK would be sent. However, echoing at the application level still involved IP error checking in these tests. Using the splintering strategy wherein error checking and flow control window advancement are done after the ACK is performed, reported latency measurements for these layers are slightly high. Using splintering, IP checksums would take place after the echo. However, if the normal model is followed and reliability operations take place first, reported latencies are lower than real latencies because TCP checksum and connection control operations would be performed before the echo is sent. Also, measurements may be slightly lower on other gigabit NICs. The Alteon NIC used for testing is known to have a slow DMA engine. 1 UDP checksumming is optional. It is not used in this testing. 20

Chapter 5 Results Tests showed a marked latency reduction in small messages echoed at the NIC level rather than at the OS or application. A 1,000 packet run gave an average of 195 s for application echoes and 117 s for NIC echoes. Echoing at the NIC decreased latency by roughly 40%. The standard deviation for NIC measurements was 3 s and for application measurements, 54 s. Figure 5.1 shows ping-pong latency measurements for 64-byte packets. The first packet echoed from both levels had higher than average latency. NIC startup costs may explain this delay. 100, 1000, and 5000 packet iterations gave nearly identical latency measurements. Calculated average latencies were slightly different due to the degree of influence of the preliminary high-latency packets. Figure 5.1 shows clusters of low-latency application echoes. Approximately one-third of these packets had round-trip times very close to NIC latency measurements. This suggests that the operating system was running during this time interval and able to accept incoming packets without a scheduling delay. Messages for the echo application interrupt the operating system, which is already running. The operating system schedules itself to run frequently because there is only one explicit user process running on the echo server system. Transferring a packet payload incurs a context switch into the application and a 21

Chapter 5. Results 300 NIC Application 250 200 Latency (usec) 150 100 50 0 0 100 200 300 400 500 600 700 800 900 1000 Message Number Figure 5.1: Latency measurements for NIC and application echo (64-byte packets). payload copy from kernel space to user space. Latency measurements suggest that this context switch is minimal and does not involve flushing cache or TLB entries. For these reasons, a high percentage of application echos to incur almost as little latency as NIC echos. When six spinning processes were added to the system on which the application echo server ran, the number of packets at the low latency level dropped by over 50%. The intricate relationship between the MAC and DMA engines in the Alteon NICs may help explain the minimal difference between the low application echo measurements and the NIC echo measurements. The Alteon NICs provide no strong mechanisms for controlling packet transmission. Packets are enqueued in the MAC engine, but other factors control when the packet is actually sent. It appears that the DMA engine must be active 22

Chapter 5. Results for enqueued packets to leave the NIC. This posed a problem in the context of echoing packets at the NIC level without using DMA to transfer packet payloads to the OS. During application-level echoing, the DMA engine is constantly running. For NIC echos, the DMA engine was coerced to allow the MAC engine to put echo packets on the wire. This implies that the DMA engine was not active during the NIC echo process and thus an incurred engine startup cost of roughly 5 s on each packet sent. Thus the NIC measurements may be slightly high in this implementation because of Alteon hardware issues. Because the packet payload used in testing is so small (10 bytes), the time the OS spends copying this data into user space is negligible. Further testing with larger messages showed a larger gap between the low-level application echo latencies and NIC echo latencies. Minimum latencies for application echos went up 16 s and 106 s for 512-byte and 1KB messages, respectively. This suggests that a zero-copy mechanism is only valuable for small messages in terms of their movement from the NIC to the OS. Possible scheduling and context switch costs incur high latency in this part of the message path. The OS to application copy incurs almost no latency for small packets. Using the higher level of application latency measurements as a baseline, tests show echoing at the network interface card saves approximately 115 s of delay. Amortized over low and high application echo measurements reported here, latency decreases by roughly 78 s. If the echo packet were a TCP acknowledgment message, a sender s TCP flow control window would be advanced tens of microseconds sooner. This latency reduction would occur the majority of the time on a loaded system because the operating system often is not running during message arrival. On a system running a single process, this performance gain may not occur as frequently. In either case, splintering TCP acknowledgment is a viable method to reduce application-to-application communications delay. 23

Chapter 6 Related Work 6.1 MPICH Application-Bypass Buntinas, Panda, and Brightwell present an application-bypass strategy to reduce message latency, communications overhead, and CPU idle time[1]. Their research focuses on highperformance applications that frequently use MPICH broadcast operations over a Myrinet network. They note that process skew, or lack of synchronization, can lead to compute nodes wasting processor cycles simply waiting for other nodes to complete tasks. In a tree broadcast model, where nodes receive a message and forward it to their children, this idling causes latency to cascade and increase as the message spreads. Additional latency is incurred when the process receiving the message performs other tasks before calling the MPICH broadcast function to forward the message. In this application-bypass model, the MPICH library forwards a broadcast message as soon as it is received. By removing the application from this process and broadcasting at a lower level of the network stack, child processes receive messages faster. They do not wait on another application to make an explicit broadcast call. This application-bypass implementation decreases message latency even as processor skew increases. Results show up to a sixteen-fold improvement for 24

Chapter 6. Related Work MPICH broadcasts using application-bypass versus those that do not. Both MPICH application-bypass and splintering avoid traveling through upper layers of the network stack to reduce communications delay and overhead. However, MPICH application-bypass deals strictly with MPI, whereas splintering TCP acknowledgments does not pertain to any particular message transmission interface and deals with a more ubiquitous protocol. In the context of process skew, splintering can be modified to provide mechanisms for higher level libraries to provide synchronization routines on the network interface card. 6.2 Offloading RMPP RTS/CTS Maccabe et. al experiment with offloading parts of the Reliable Message Passing Protocol (RMPP) to programmable gigabit Ethernet cards[11]. In particular, they offload message fragmentation and reassembly, zero-copy mechanisms, and Request to Send (RTS) and Clear to Send (CTS) message processing. Arriving data packets are normally copied into kernel space upon receipt. By offloading processing associated with message receipt and using a zero-copy strategy, communication overhead decreases. When the RMPP module sends a CTS packet to the requesting host, the RMPP library also sends the NIC the memory address at which it can place incoming packets. Offloading sender-side processing allows the NIC to respond to RMPP messages locally, rather than delegating this responsibility to the OS. When an initial RTS message is sent through the NIC, a message buffer descriptor is also sent, notifying the NIC of the memory location of data to be transmitted. The NIC can then access this memory directly to construct and send packets, without going through with the OS. These send and receive optimizations allow more packets to be sent in a given period 25

Chapter 6. Related Work of time, increasing network bandwidth. Experimenting with this implementation using Alteon ACENIC gigabit Ethernet cards yielded a bandwidth increase of over 50% and up to a 20% increase of CPU availability for messages of several hundred kilobytes. This approach deals with a specialized message passing protocol rather than a commodity protocol. It also focuses on increasing bandwidth and processor availability, rather than message latency. Offloading message fragmentation and reassembly focuses on large message transmission, as do RTS and CTS operations. Applications transmitting small messages do not need RTS/CTS dynamics or fragmentation handling and thus do not reap a performance gain. 6.3 Trapeze: Optimizations for TCP/IP over Gigabit Networks Chase et. al propose a variety of optimizations for gigabit networks using commodity protocols. Their Trapeze messaging system seeks to maintain high network bandwidth, increase CPU utilization at both the sender and receiver, and decrease communications latency [5]. They implement TCP/IP checksum offload, adaptive message pipelining, zerocopy data movement, and configurable MTUs. Scatter/gather DMA allows payloads to occupy multiple, noncontiguous page frames. Pipelining allows overlapping DMA transfers on the I/O bus and network to reduce large message latency. Experiments using these techniques over Myrinet and gigabit Ethernet yielded large increases in TCP bandwidth and moderate decreases in CPU utilization. UDP one-way message latency for 64-byte packets decreased by roughly 40%. IP packets are encapsulated in Trapeze messages. Protocol headers are located in the control message portion of a Trapeze packet and a payload is attached. Thus the implementation does not use purely commodity protocols. The ability to configure large MTUs 26

Chapter 6. Related Work and adaptive message pipelining yields a performance gain only for large messages. It is only applicable for Myrinet networks, which have an unlimited frame size. Latency reduction is based on only zero-copy techniques and figures are for one-way transmission. Zero-copy decreases delay between the network interface card and application, as it does in splintering. However, Trapeze ignores application-to-application round trip message relay and acknowledgment. 6.4 Offloading IP Fragmentation and Reassembly IP packet endpoint fragmentation and reassembly can be delegated to the NIC to reduce communications overhead. Gilfeather et. al demonstrate that this technique yields increased CPU utilization and reduced message latency[8]. This task is not too computationally intensive for a commodity programmable NIC, unlike IP error checking. Splintering IP in this manner uses available resources optimally. Messages larger than the transmission medium s MTU are allowed to pass through the network stack above the data link layer. The NIC fragments the messages to MTU size before transmission. When the NIC receives a fragmented message it reassembles it locally, rather than delegating the task to the OS. The entire packet then travels up the network stack. Offloaded fragmentation and reassembly is transparent to the application. Because messages are fragmented at a lower level in the network stack, simple modifications must be made to allow messages larger than the MTU at those levels. Unlike interrupt coalescing, this technique does not increase message latency. The functionality of the protocol is left unchanged. Its tasks are simply taking place on different hardware. Offloading this part of the IP protocol decreases CPU communications processing by about 50% while increasing processor effective utilization nearly twofold for large messages. This technique is an optimization for large messages. Small messages are not frag- 27

Chapter 6. Related Work mented or reassembled at any layer of the network stack. Thus offloading this operation does not decrease latency or optimize any other communication parameters for small message transfer. 28

Chapter 7 Conclusions Splintering takes advantages of available resources at an optimal level. By allowing the operating system to retain control of resources, memory and connections are managed efficiently. Moreover, the operating system sees a reduction in communications overhead. Parts of the protocol stack are offloaded to the application and network interface card, but neither are overwhelmed by the tasks they inherit. Splintering TCP acknowledgments yields reduced round-trip communications delay between applications in a high-performance computing environment. Although end-toend reliability is sacrificed, the process of sending TCP acknowledgments no longer resides in the latency path. Thus there is no per-packet delay cost for sending ACKs. Also, bypassing the operating system on message receipt decreases communications overhead. This increases processor availability and allows the CPU to devote more cycles to computation. These modifications adapt commodity TCP to a parallel computing environment. Using splintering and other optimizing techniques, TCP may become a viable protocol for high-performance computing. 29

References [1] D. Buntinas, D. K. Panda, and R. Brightwell. Application-bypass broadcast in MPICH over GM. In Proceedings of The Third IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2003), 2003. To appear. [2] J. Chase, A. Gallatin, and K. Yocum. End system optimizations for high-speed TCP. IEEE Communications Magazine, 39(4):68 74, 2001. [3] D. Culler, R. Karp, D. Patterson, A. Sahay, K. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: Towards a Realistic Model of Parallel Computation. In Proceedings 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 1 12, 1993. [4] R. Fatoohi and S. Weeratunga. Performance evaluation of three distributed computing environments for scientific applications. In Proceedings Supercomputing 94, pages 400 409, Washington, DC, November 1994. [5] A. Gallatin, J. Chase, and K. Yocum. Trapeze/IP: TCP/IP at near-gigabit speeds. In Proceedings of the USENIX 99 Technical Conference, pages 109 120, June 1999. [6] Network & communications: Gigabit ethernet. Web, April 2003. http://google.cnet.com/shopping/0-11623-301-0-0.html?tag=stbc.gp. [7] P. Gilfeather and A. Maccabe. Making TCP Viable as a High Performance Computing Protocol. In Proc. Los Alamos Computer Science Institute (LACSI) Symposium, 2002. [8] P. Gilfeather, A. Maccabe, and T. Underwood. Increasing performance in commodity IP. In Proc. Grace Hopper Celebration of Women in Computing, 2002. [9] P. D. Haynes and M. Côté. Parallel fast fourier transforms for electronic structure calculations. Computer Physics Communications, 129:130 136, 2000. 30

References [10] E. León. An MPI tool to measure application sensitivity to variation in communication parameters. Technical Report TR-CS-2003-20, University of New Mexico, 2003. [11] A. Maccabe, W. Zhu, J. Otto, and R. Riesen. Experience in offloading protocol processing to a programmable NIC. In SC2002 High Performance Networking and Computing, Baltimore, MD, November 2002. [12] R. P. Martin, A. Vahdat, D. E. Culler, and T. E. Anderson. Effects of communication latency, overhead, and bandwidth in a cluster architecture. In ISCA, pages 85 97, 1997. [13] Myrinet product list. Web, April 2003. http://www.myri.com/myrinet/product list.html. [14] P. Shivam, P. Wyckoff, and D. K. Panda. EMP: Zero-copy OS-bypass NIC-driven Gigabit Ethernet message passing. In Proceedings of SC2001, 2001. 31