An Extensible Message-Oriented Offload Model for High-Performance Applications

An Extensible Message-Oriented Offload Model for High-Performance Applications Patricia Gilfeather and Arthur B. Maccabe Scalable Systems Lab Department of Computer Science University of New Mexico pfeather@cs.unm.edu, maccabe@cs.unm.edu Abstract In this paper, we present and validate a new model that captures the benefits of protocol offload in the context of high performance computing systems. In contrast to the LAWS model, the extensible message-oriented offload model (EMO) emphasizes communication in terms of messages rather than flows. In contrast to the LogP model, EMO emphasizes the performance of the network protocol rather than the parallel algorithm. The extensible message-oriented offload model allows us to consider benefits associated with the reduction in message latency along with benefits associated with reduction in overhead and improvements to throughput. We show how our model can be mapped to the LAWS and LogP model and we present preliminary results to verify the model. 1 Introduction Network speeds are increasing. Both Ethernet and Infiniband are currently promising 4 Gb/s performance, and Gigabit performance is now Los Alamos Computer Science Institute SC R717H- 2921 commonplace. Offloading all or portions of communication protocol processing to an intelligent NIC (Network Interface Card) is frequently used to ensure that benefits of these technologies are available to applications. However, determining what portions of a protocol to offload is still more of an art than a science. Furthermore, there are few tools to help protocol designers choose appropriate functionality to offload. Shivam and Chase created the LAWS model to study the benefits and tradeoffs of offloading[2]. However, there are no models that address the specific concerns of high-performance computing. We create a model that explores offloading of commodity protocols for individual messages which allows us to consider offloading performance for message-oriented applications and libraries like MPI. In this paper, we provide an new model, the extensible message-oriented offload model (EMO), that allows us to evaluate and compare the performance of network protocols in a message-oriented offloaded environment. First, we overview two popular performance models, LAWS and LogP. Second, we review the characteristics of high-performance computing and show how the current models do not meet the specific needs of modeling for high-end applications. Third, we introduce EMO which is a lan- 1

guage for capturing performance of various offload strategies for message-oriented protocols. We explain a model of overhead and latency using EMO and map EMO onto LAWS. Finally, we present our preliminary results in verifying the model by comparing modeled latencies for interrupt coalescing with actual results. 2.2 LAWS The LAWS model was created to begin to quanitfy the debate over offloading the TCP/IP protocol. LAWS attempts to characterize the benefits of transport offload. It is based on the following four ratios. 2 Previous Models of Communication Performance Lag ratio(α) - the ratio of host processing speed to NIC processing speed There are two performance models we considered before creating one that was specific to the needs of high-performance computing. Application ratio(γ) - the ratio of application processing to communication processing - how much CPU the application needs 2.1 LogP LogP was created as a new model of parallel computation to replace the outdated data-oriented PRAM model. It is based on following four parameters. L - upper bound on latency o - protocol processing overhead g - minimum interval between message sends or message receives P - number of messages LogP is message-oriented. The original model assumes short messages, but several extensions to LogP for large message sizes have been proposed. However, its focus is on the execution of the entire parallel algorithm and not on the performance of the particular network protocol. The LogP model give us the understanding that the overhead and gap of communications must be minimized to increase performance of parallel algorithms. It does not, however, give us any insight into how this may be done. Wire ratio(σ) - the ratio of bandwidth when host is at 1/ Structural ratio(β) - Ratio of overhead for communication with offload to overhead without offload - what processing was eliminated by offload The LAWS model effectively captures the benefits and constraints of protocol processing offload. Furthermore, because the ratios are independent of a particular protocol, LAWS is extensible. When extending LAWS to an applicationlevel library, the application ratio (γ) must reflect the additional overhead associated with moving data and control from the operating system (OS) to the application library, but this is trivial. However, LAWS is stream-oriented and not message oriented. Specifically, it cannot help us to understand how to minimize gap or latency which are primary needs for our model. LAWS is a good model of the behavior of offloading transport protocols. We provide a mapping from our high-performance messageoriented model to the LAWS model in section 3.3 so we may benefit from the understanding that the LAWS model brings to the question of how and when to offload. 2

3 Extensible Message-Oriented Offload Model Neither the LAWS model nor the LogP model help us to evaluate methods for offloading network protocols in the high-performance computing environment. LAWS is not message-oriented and so it does not allow us to model either gap or latency. LogP is not specifically oriented to network protocol performance. We needed a new model of communication in high-performance computing. 3.1 Requirements for High-performance Model We wanted to create a simple language for modeling methods of offload in order to understand how they relate to high-end applications. In addition to the ability to model latency, gap and overhead, we had three requirements for our performance model. We wanted the model to extend through all layers of a network protocol stack including message libraries like MPI at the application layer. We wanted to model offload onto a NIC as this was our primary focus. We wanted to model behavior in a message-oriented environment. 3.1.1 Extensible Extensibility is necessary for our model because network protocols are often layered. Layered above the network protocols are more layers of message-passing API s and languages like MPI and LINDA. We developed our model to extend through the layers of network protocols and message-passing API s. For example, one of the reasons that TCP has not been considered competitive in highperformance computing is that the MPI implementations are not efficient. The MPI implementations over TCP are generally not well integrated into the TCP protocol stack. A zero-copy TCP implementation still requires a copy in application space as the MPI headers are stripped, the MPI message is matched and the MPI data is moved to the appropriate application buffer. A zero-copy implementation of MPI will require a way to strip headers and perform the MPI match at the NIC level. Again, application libraries like LINDA are implemented on top of MPI. The same process will continue through all layers of the communication stack. We want our model to be extensible so we can capture this behavior. 3.1.2 Offload Offloading parts or all of the processing of a protocol in order to decrease overhead has been commonplace for years. In the commodity markets of Internet serving and file serving, TCP offload engines (TOEs) are becoming more common as they attempt to compete with other networks like FibreChannel. In high-performance computing, Myrinet, VIA, Quadrix and IB all do some or all of their protocol processing on a NIC or on the network itself. Offload is an attractive way to keep overheads on the host low. Unfortunately, all of these networking solutions are either unavailable, not scalable, or very expensive. Our goal in producing this model was to provide a way to explore whether smart offloading of a commodity protocol like IP or TCP could eventually make these protocols competitive in the high-end computing arena. However, we have found this model useful in developing strategies for offloading various protocols. Offloading is the central focus of this model. Like the LAWS model, the goal of this model is to explore the benefits of offloading transport protocol processing. Unlike the LAWS model, we are doing so in the context of message-oriented highperformance applications. 3

3.1.3 Message-Oriented We need a performance model that is messageoriented to that we can specifically model and compare methods of offloading that decrease overhead or latency. Because we assume a low loss network with application level flow control, we choose to focus exclusively on the protocol behavior and the sending and receiving host. To this end, we assume the network does not affect either the overhead of a message or its latency. Clearly, the network affects latency, but we are concerned with how the host protocol processing affects latency. The message-oriented nature of the model also provides the structure necessary to model gap in a new way. The message-oriented nature of a model along with the emphasis on the communication patterns on a single host allows us to focus on the benefits of offloading protocol processing specifically as a measure of overhead and gap. 3.2 EMO We wanted a performance model that is not specific to any one protocol, but our choices were informed by our understanding of MPI over TCP over IP. The Extensible Message-oriented Offload model (EMO) Figure 1. The latency and overhead that is necessary to communicate between components must include the movement of data when appropriate. The variables for this model are as follows: # cycles of protocol processing on NIC Rate of CPU on NIC L NH Time to move data and control from NIC to Host OS # cycles of protocol processing on Host Rate of CPU on Host Latency = L_ha Overhead = O_ha Host OS CPU rate = R_h Protocol overhead = C_h Latency = L_nh Overhead = O_nh NIC CPU rate = R_n Protocol overhead = C_n Application Protocol overhead = C_a Latency = L_na Overhead = O_na Figure 1. The Extensible Message-oriented Offload Model L HA Time to move data and control from Host to App L NA Time to move data and control from NIC to App C A # cycles of protocol processing at Application O NH # host cycles to move data and control from NIC to Host OS O HA # host cycles to move data and control from Host OS to App O NA # host cycles necessary to communicate and move data from NIC to Application 3.2.1 Extensibility The model allows for extensibility with respect to protocol layers. We hope this model can be useful for researchers working on offloading parts of the MPI library (like MPI MATCH) or parts of the matching mechanisms for any language or API. We constructed the model so that it can grow through levels of protocols. For example, Our model can by extended, or telescoped, to include 4

offloading portions of MPI. We simply add C m, L am and O am to the equations for overhead and latency. 3.2.2 Overhead Gap max min L W L W EMO allows us to explore the fundamental cost of any protocol, its overhead. Overhead occurs at the per-message and per-byte level. Our model allows us to estimate and graphically represent our understanding about overhead for various levels of protocol offload. Overhead is modeled as Overhead O NH O HA C A O NA. However, all methods will only use some of the communication patterns to process the protocol. Traditional overhead, for example, will not use the communication path between the NIC and the application and does no processing at the application. Traditional Overhead O NH 3.2.3 Gap O HA Gap is the interarrival time of messages to an application on a receive and the interdeparture time of message from an application on a send. It is a measure of how well-pipelined the network protocol stack is. But gap is also a measure of how well-balanced the system is. If the host processor is processing packets for a receive very quickly, but the NIC cannot keep up, the host processor will starve and the gap will increase. If the host processor is not able process packets quickly enough on a receive, the NIC will starve and the gap will increase. If the network is slow, both the NIC and host will starve. Gap is a measure of how well-balanced the system is. As we minimize gap, we balance the system. 3.2.4 Latency Latency is modeled as Latency L NH L HA L NA C A. However, all methods will only use some of the communication patterns to process the protocol. Traditional network protocols, for example, will not use the communication path between the NIC and the application and does no processing at the application. Traditional Latency 3.3 Mapping EMO onto LAWS L NH L HA EMO can be mapped directly onto LAWS which is useful in order to provide a context for the model in the larger offload community. Because LAWS concentrates on an arbitrary number of bytes in a specified amount of time and EMO concentrates on an arbitrary amount of time for a specified amount of bytes, we will have to make a few assumption. The parameters that make up the ratios in the LAWS model are below. o - CPU occupancy for communication overhead per unit of bandwidth a - CPU occupancy for the application per unit of bandwidth X - Occupancy scale factor for host processing Y - Occupancy scale factor for NIC processing L W 5

p - Portion of communication overhead o offloaded to NIC B - Bandwidth of network path LAWS assumes a fixed amount of time. Let the fixed amount of time be equal to the time to receive a message of length N on a host. We ll call this time T.This allows us to determine the total number of host cycles possible. C t R h T For simplicity let s define an overhead total for EMO. O T O NH O HA C A O NA Now that we have a fixed time T, a fixed number of bytes N, and the total number of host cycles C t, we can map EMO onto the LAWS parameters. The most difficult part of the mapping from EMO to LAWS is the fact that the communication overhead o is constant while the percentage offloaded p is variable. Thus, p is a ratio used to compare two different offload schemes. Our offload schemes are modeled with different values for and to reflect this difference. We use C H to represent the amount of protocol processing done on the NIC for a second offload scheme. We assume that is incremented by C H since this is the assumption of the LAWS model. Changes to the actual amount of protocol processing under various offload schemes are reflected in the LAWS model ratio β. Changes to the amount of protocol processing done under various offload schemes are reflected directly in different values in EMO. o O NH O HA C A O NA a C T o X Y C H p o N B T LAWS derives all of its ratios from these parameters with the exception of β. The structural ratio describes the amount of processing saved by using a particular offload scheme. We can quantify this directly from our model assuming the second offload mechanism is denoted by variables with. β C N O NH O NH C H O HA O HA C A C A O NA O NA Now we have all of the necessary elements to map EMO onto LAWS. This is useful for understanding how the EMO model fits into the larger area of offloading of protocol processing in commodity applications. We created EMO for highend computing so we can explore gap and overhead for a message-oriented applications. 4 Model Verification - Initial Results We created EMO as a language for comparing methods of offload, but it can be considered an analytical model as well. Our initial verification concerned interrupt coalescing. We verified our model for latency of UDP with no interrupt coalescing and with default interrupt coalescing. We measured latencies by creating a ping-pong test between Host A and Host B. Host A remains constant throughout the measurements. Host A is a 933 MHz Pentium III running an unmodified Linux 2.4.25 kernel with the Acenic Gigabit Ethernet card set to default values for interrupt coalescing and transmit ratios. Host B is the same 6

machine connected to Host A by cross-over fiber. Host B also runs an unmodified version of the Linux 2.4.25 kernel. We measured overhead by modifying our pingpong test. Host A continues the ping-pong test, but Host B includes a cycle-soaker that counts the number of cycles that can be completed while communication is in progress. 4.1 Latency In order to verify the model for latency, we measured actual latency and approximated measurements for the various parts of the sum for our equation: Traditional Latency L NH L HA Our model is verified to the extent that the sum of the addends approximates the actual measured latency. 4.1.1 Application to Application Latency In order to measure the traditional latency, we ran a simple udp echo server in user space on Host B. Host A simply measures ping-pong latency for various size messages. We measured this latency from 1 byte messages through 89 byte messages. We wanted to remain within the jumbo frame size to avoid fragmentation and reassembly or multiple packets, but we wanted to exercise the crossing of page boundaries. The page size for the Linux 2.4.25 kernel is 4KB. We measured application to application latency when Host B has default interrupt coalescing parameters set and also when Host B had interrupt coalescing turned off. 4.1.2 Application to NIC Latency In order to measure application to NIC latency we moved the UDP echo server into the Acenic firmware. This allows us to measure the latency of a message as it travels through Host A, across the wire, and to the NIC on Host B. This latency should not reflect the cost of the interrupt on Host B, the cost of moving through the kernel receive or send paths on Host B, nor the cost of the copy of data into user space on Host B. The UDP echoserver exercises all of the code in the traditional udp receive and send paths in the Acenic firmware with the exception of the DMA to the host so we assume that the application to NIC latency measurement includes the portion of the latency calculation. However, it is important to note that the startup time for the DMA engine on the Acenic cards is approximately 5mus. This will be accounted for in the L NH portion of the calculation. 4.1.3 Application to Kernel Latency For our initial results, we chose to measure the latency between an application on Host A and the kernel on Host B using the ping utility already provided by the kernel. This was an easy measurement to procure and should give a reasonable approximation of the latency between the application and the kernel. While the ping message does not travel the exact code path as a UDP message in the kernel, it does exercise the same IP path and very similar code at the level above IP. The ICMP or ping message does not perform a copy of data to user space and does not perform a route lookup. Unfortunately for our purposes, the ping utility does perform a checksum so the application to kernel measurements will be linearly dependant on the size of the message. This is a major differences. In the future, we would like to add the echoserver into the kernel so as to be as consistent as possible in our calculations. The application to kernel latency was measured with an unmodified Host A with default interrupt coalescing and with Host B with both default interrupt coalescing and no interrupt coalescing. We expect that the differences between interrupt coalescing and no interrupt coalescing should be present at this level. 7

4.1.4 Results 6 Default Interrupts No Interrupts The expectation is that the average latency will be generally smaller when there is no interrupt coalescing. This is shown in our model. Interrupt coalescing can be seen as a move to decrease the overhead effect of the interrupt O NH at the cost of the time that an interrupt will reach the host. This means we expect that L NH is the variable affected. We cannot currently fully verify this result. We have no reasonable way to isolate the processing on the NIC and have not fully measured processing time or overhead in the kernel. However, Figure 2 shows that generally latency is slightly lower when Host B disables interrupt coalescing. Moreover, Figure 3 further isolates this phenomenon by approximating the ping-pong measurement without the final move from kernel to application space. If we let X be the latency of all communication on Host A, the wire and the NIC on Host B, then Figure 3 represents a comparison between: X L NH and the interrupt coalescing latency X L NH Avg latency in usec 5 4 3 2 1 Figure 3. Application to Kernel Latency where X represents the latency of all communication on Host A and the wire. Our original intent was to use the application to NIC measurements to isolate X and. However, Figure 4 shows clearly that our application to NIC measurements are not useful. Because the slope is greater than the slope of the actual latency, it appears that our udp echoserver is touching the data more often than the traditional application echoserver. At the very least, the udp echoserver is clearly touching the data and the speed of the NIC (88MHz) is exacerbating the slowdown. We expect this to be a temporary problem as there is no design issue that precludes returning the data without touching it. This is simply an implementation flaw. 6 Default Interrupts No Interrupts 8 7 Actual ToKernel ToNic 5 6 Avg latency in usec 4 3 Avg latency in usec 5 4 3 2 2 1 1 Figure 2. Application to Application Latency Our actual latency can be represented as X L NH L HA Figure 4. Latency with No Interrupt Coalescing Figure 4 also presents us with another artifact 8

dissimilar from the model. From our preliminary discussions of EMO, we assumed that L HA is linear with the size of the message because the copy of data occurs here. However, the application to kernel route is clearly also touching all data. This is presumably because the kernel is checksumming the IP packets and therefore touching the data as well. Clearly, the the udp echoserver on the NIC must be rendered useful and the application to kernel path cannot include the checksum in order for us to fully verify our model. However, the insights provided by the model for interrupt coalescing are verified by the measurements. 4.2 Overhead In order to verify the model for overhead, we measured actual overhead and approximated measurements for the various parts of the sum for our equation: Traditional Overhead O NH O HA Our model is verified to the extent that the sum of the addends approximates the actual measured overhead. 4.2.1 Application to Application Overhead We measured the amount of cycle-soak work Host B can do without any communication occurring. Then we measured the amount of cycle-soak work Host B can do with standard ping-pong communication of various sized messages occurring between an application on Host A and a udp echoserver on Host B. The difference between these to amounts of work is the overhead associated with the communication. It is the the number of cycles being taken away from calculation. We measured the overhead of application to application communication with default interrupts on Host A and with both default interrupts on Host B and with no interrupts on Host B. We expect that the overhead of application to application communication when Host B is using interrupt coalescing will be lower than when Host B is not using interrupt coalescing. 4.2.2 Kernel to Application Overhead In order to measure the overhead for kernel to application communication, Host A ran a ping flood on Host B and Host B ran the cycle-soak work calculation. We expect that interrupt coalescing will still make a difference at this level of communication so that Host B with no interrupt coalescing will have higher overhead than Host B with default interrupt coalescing. However, we do not expect the size of the message to make as much of a difference in the communication overhead at this level as it does at the application to application communication level. 4.2.3 NIC to Application Overhead In order to measure the overhead for application to NIC communication, Host B is run with the modified Acenic firmware with the UDP echoserver at the NIC level. Host A runs the UDP ping-pong test and Host B runs the cycle-soak work calculation. We expect quite low overhead on Host B as there is no host involvement with the communication and therefore no communication overhead. 4.2.4 Results As expected, there was no communication overhead when the UDP echoserver runs at the NIC level. This verified our modeled expectations. We expected that the overhead for application to application communication would be lower when Host B employed interrupt coalescing. Figure 5 difference is negligible at best, but this reflects general results regarding interrupt coalescing and its efficacy in lowering overhead[1]. 9

Moreover, Figure 6 shows that this effect also follows through the kernel to application communication path as expected since the interrupt is still present in this path. message. These results should bring a more clear understanding of the role of the memory subsystem in EMO. 6e+8 Actual ToKernel 1.6e+8 1.4e+8 Default Interrupts No Interrupts 5e+8 Avg overhead in cycles 1.2e+8 1e+8 8e+7 6e+7 Avg overhead in cycles 4e+8 3e+8 2e+8 1e+8 4e+7 2e+7 Figure 5. Application to Application Overheads Figure 7. Overhead with Default Interrupt Coalescing 1.6e+8 Default Interrupts No Interrupts 5 Conclusions and Future Work Avg overhead in cycles 1.4e+8 1.2e+8 1e+8 8e+7 6e+7 4e+7 2e+7 Figure 6. Kernel to Application Overheads The most interesting result, however, does not verify our assumptions regarding the EMO model. The overhead for application to application latency does not increase with the size of the message as we expected. Figure 7 shows the gap that represents O HA remains constant rather than increasing as the size of the message increases as expected. The overall overhead also remains constant. Clearly, for small messages, the overhead is predominantly the interrupt overhead. Measurements for much larger messages should reveal overheads that begin to slope with the size of the The extensible message-oriented offload model (EMO) allows us to explore the space of network protocol implementation from the application messaging layers through to the NIC on a message by message basis. This new language gives us a fresh understanding of the role of offloading in terms of overhead, latency and gap in high-performance systems. The preliminary work on verification has yielded areas of further research. A more effective means of isolating various parts of the message path must be created before verification can be complete. The next steps are a profiling of the kernel and a reimplementation of the udp echoserver at the NIC and at the kernel. Clearly, the model remains conceptual until all variables can be isolated and all assumptions can be verified. EMO as a language for exploring offload design is already being used. We plan on exploring the model of gap in EMO to bound the resource requirements for NICs or TCP offload engines at 1Gb/s speeds. We plan also to extend EMO to 1

include memory management considerations such as caching. References [1] P. Gilfeather and T. Underwood. Fragmentation and high performance ip. In Proc. of the 15th International Parallel and Distributed Processing Symposium, April 21. [2] P. Shivam and J. Chase. On the elusive benefits of protocol offload. In SIGCOMM workshop on Network-I/O Convergence: Experience, Lessons, Implications (NICELI), August 23. 11