An Extensible Message-Oriented Offload Model for High-Performance Applications
|
|
- Rosalyn Hawkins
- 5 years ago
- Views:
Transcription
1 An Extensible Message-Oriented Offload Model for High-Performance Applications Patricia Gilfeather and Arthur B. Maccabe Scalable Systems Lab Department of Computer Science University of New Mexico Abstract In this paper, we present and validate a new model that captures the benefits of protocol offload in the context of high performance computing systems. In contrast to the LAWS model, the extensible message-oriented offload model (EMO) emphasizes communication in terms of messages rather than flows. In contrast to the LogP model, EMO emphasizes the performance of the network protocol rather than the parallel algorithm. The extensible message-oriented offload model allows us to consider benefits associated with the reduction in message latency along with benefits associated with reduction in overhead and improvements to throughput. We show how our model can be mapped to the LAWS and LogP model and we present preliminary results to verify the model. 1 Introduction Network speeds are increasing. Both Ethernet and Infiniband are currently promising 4 Gb/s performance, and Gigabit performance is now Los Alamos Computer Science Institute SC R717H commonplace. Offloading all or portions of communication protocol processing to an intelligent NIC (Network Interface Card) is frequently used to ensure that benefits of these technologies are available to applications. However, determining what portions of a protocol to offload is still more of an art than a science. Furthermore, there are few tools to help protocol designers choose appropriate functionality to offload. Shivam and Chase created the LAWS model to study the benefits and tradeoffs of offloading[2]. However, there are no models that address the specific concerns of high-performance computing. We create a model that explores offloading of commodity protocols for individual messages which allows us to consider offloading performance for message-oriented applications and libraries like MPI. In this paper, we provide an new model, the extensible message-oriented offload model (EMO), that allows us to evaluate and compare the performance of network protocols in a message-oriented offloaded environment. First, we overview two popular performance models, LAWS and LogP. Second, we review the characteristics of high-performance computing and show how the current models do not meet the specific needs of modeling for high-end applications. Third, we introduce EMO which is a lan- 1
2 guage for capturing performance of various offload strategies for message-oriented protocols. We explain a model of overhead and latency using EMO and map EMO onto LAWS. Finally, we present our preliminary results in verifying the model by comparing modeled latencies for interrupt coalescing with actual results. 2.2 LAWS The LAWS model was created to begin to quanitfy the debate over offloading the TCP/IP protocol. LAWS attempts to characterize the benefits of transport offload. It is based on the following four ratios. 2 Previous Models of Communication Performance Lag ratio(α) - the ratio of host processing speed to NIC processing speed There are two performance models we considered before creating one that was specific to the needs of high-performance computing. Application ratio(γ) - the ratio of application processing to communication processing - how much CPU the application needs 2.1 LogP LogP was created as a new model of parallel computation to replace the outdated data-oriented PRAM model. It is based on following four parameters. L - upper bound on latency o - protocol processing overhead g - minimum interval between message sends or message receives P - number of messages LogP is message-oriented. The original model assumes short messages, but several extensions to LogP for large message sizes have been proposed. However, its focus is on the execution of the entire parallel algorithm and not on the performance of the particular network protocol. The LogP model give us the understanding that the overhead and gap of communications must be minimized to increase performance of parallel algorithms. It does not, however, give us any insight into how this may be done. Wire ratio(σ) - the ratio of bandwidth when host is at 1/ Structural ratio(β) - Ratio of overhead for communication with offload to overhead without offload - what processing was eliminated by offload The LAWS model effectively captures the benefits and constraints of protocol processing offload. Furthermore, because the ratios are independent of a particular protocol, LAWS is extensible. When extending LAWS to an applicationlevel library, the application ratio (γ) must reflect the additional overhead associated with moving data and control from the operating system (OS) to the application library, but this is trivial. However, LAWS is stream-oriented and not message oriented. Specifically, it cannot help us to understand how to minimize gap or latency which are primary needs for our model. LAWS is a good model of the behavior of offloading transport protocols. We provide a mapping from our high-performance messageoriented model to the LAWS model in section 3.3 so we may benefit from the understanding that the LAWS model brings to the question of how and when to offload. 2
3 3 Extensible Message-Oriented Offload Model Neither the LAWS model nor the LogP model help us to evaluate methods for offloading network protocols in the high-performance computing environment. LAWS is not message-oriented and so it does not allow us to model either gap or latency. LogP is not specifically oriented to network protocol performance. We needed a new model of communication in high-performance computing. 3.1 Requirements for High-performance Model We wanted to create a simple language for modeling methods of offload in order to understand how they relate to high-end applications. In addition to the ability to model latency, gap and overhead, we had three requirements for our performance model. We wanted the model to extend through all layers of a network protocol stack including message libraries like MPI at the application layer. We wanted to model offload onto a NIC as this was our primary focus. We wanted to model behavior in a message-oriented environment Extensible Extensibility is necessary for our model because network protocols are often layered. Layered above the network protocols are more layers of message-passing API s and languages like MPI and LINDA. We developed our model to extend through the layers of network protocols and message-passing API s. For example, one of the reasons that TCP has not been considered competitive in highperformance computing is that the MPI implementations are not efficient. The MPI implementations over TCP are generally not well integrated into the TCP protocol stack. A zero-copy TCP implementation still requires a copy in application space as the MPI headers are stripped, the MPI message is matched and the MPI data is moved to the appropriate application buffer. A zero-copy implementation of MPI will require a way to strip headers and perform the MPI match at the NIC level. Again, application libraries like LINDA are implemented on top of MPI. The same process will continue through all layers of the communication stack. We want our model to be extensible so we can capture this behavior Offload Offloading parts or all of the processing of a protocol in order to decrease overhead has been commonplace for years. In the commodity markets of Internet serving and file serving, TCP offload engines (TOEs) are becoming more common as they attempt to compete with other networks like FibreChannel. In high-performance computing, Myrinet, VIA, Quadrix and IB all do some or all of their protocol processing on a NIC or on the network itself. Offload is an attractive way to keep overheads on the host low. Unfortunately, all of these networking solutions are either unavailable, not scalable, or very expensive. Our goal in producing this model was to provide a way to explore whether smart offloading of a commodity protocol like IP or TCP could eventually make these protocols competitive in the high-end computing arena. However, we have found this model useful in developing strategies for offloading various protocols. Offloading is the central focus of this model. Like the LAWS model, the goal of this model is to explore the benefits of offloading transport protocol processing. Unlike the LAWS model, we are doing so in the context of message-oriented highperformance applications. 3
4 3.1.3 Message-Oriented We need a performance model that is messageoriented to that we can specifically model and compare methods of offloading that decrease overhead or latency. Because we assume a low loss network with application level flow control, we choose to focus exclusively on the protocol behavior and the sending and receiving host. To this end, we assume the network does not affect either the overhead of a message or its latency. Clearly, the network affects latency, but we are concerned with how the host protocol processing affects latency. The message-oriented nature of the model also provides the structure necessary to model gap in a new way. The message-oriented nature of a model along with the emphasis on the communication patterns on a single host allows us to focus on the benefits of offloading protocol processing specifically as a measure of overhead and gap. 3.2 EMO We wanted a performance model that is not specific to any one protocol, but our choices were informed by our understanding of MPI over TCP over IP. The Extensible Message-oriented Offload model (EMO) Figure 1. The latency and overhead that is necessary to communicate between components must include the movement of data when appropriate. The variables for this model are as follows: # cycles of protocol processing on NIC Rate of CPU on NIC L NH Time to move data and control from NIC to Host OS # cycles of protocol processing on Host Rate of CPU on Host Latency = L_ha Overhead = O_ha Host OS CPU rate = R_h Protocol overhead = C_h Latency = L_nh Overhead = O_nh NIC CPU rate = R_n Protocol overhead = C_n Application Protocol overhead = C_a Latency = L_na Overhead = O_na Figure 1. The Extensible Message-oriented Offload Model L HA Time to move data and control from Host to App L NA Time to move data and control from NIC to App C A # cycles of protocol processing at Application O NH # host cycles to move data and control from NIC to Host OS O HA # host cycles to move data and control from Host OS to App O NA # host cycles necessary to communicate and move data from NIC to Application Extensibility The model allows for extensibility with respect to protocol layers. We hope this model can be useful for researchers working on offloading parts of the MPI library (like MPI MATCH) or parts of the matching mechanisms for any language or API. We constructed the model so that it can grow through levels of protocols. For example, Our model can by extended, or telescoped, to include 4
5 offloading portions of MPI. We simply add C m, L am and O am to the equations for overhead and latency Overhead Gap max min L W L W EMO allows us to explore the fundamental cost of any protocol, its overhead. Overhead occurs at the per-message and per-byte level. Our model allows us to estimate and graphically represent our understanding about overhead for various levels of protocol offload. Overhead is modeled as Overhead O NH O HA C A O NA. However, all methods will only use some of the communication patterns to process the protocol. Traditional overhead, for example, will not use the communication path between the NIC and the application and does no processing at the application. Traditional Overhead O NH Gap O HA Gap is the interarrival time of messages to an application on a receive and the interdeparture time of message from an application on a send. It is a measure of how well-pipelined the network protocol stack is. But gap is also a measure of how well-balanced the system is. If the host processor is processing packets for a receive very quickly, but the NIC cannot keep up, the host processor will starve and the gap will increase. If the host processor is not able process packets quickly enough on a receive, the NIC will starve and the gap will increase. If the network is slow, both the NIC and host will starve. Gap is a measure of how well-balanced the system is. As we minimize gap, we balance the system Latency Latency is modeled as Latency L NH L HA L NA C A. However, all methods will only use some of the communication patterns to process the protocol. Traditional network protocols, for example, will not use the communication path between the NIC and the application and does no processing at the application. Traditional Latency 3.3 Mapping EMO onto LAWS L NH L HA EMO can be mapped directly onto LAWS which is useful in order to provide a context for the model in the larger offload community. Because LAWS concentrates on an arbitrary number of bytes in a specified amount of time and EMO concentrates on an arbitrary amount of time for a specified amount of bytes, we will have to make a few assumption. The parameters that make up the ratios in the LAWS model are below. o - CPU occupancy for communication overhead per unit of bandwidth a - CPU occupancy for the application per unit of bandwidth X - Occupancy scale factor for host processing Y - Occupancy scale factor for NIC processing L W 5
6 p - Portion of communication overhead o offloaded to NIC B - Bandwidth of network path LAWS assumes a fixed amount of time. Let the fixed amount of time be equal to the time to receive a message of length N on a host. We ll call this time T.This allows us to determine the total number of host cycles possible. C t R h T For simplicity let s define an overhead total for EMO. O T O NH O HA C A O NA Now that we have a fixed time T, a fixed number of bytes N, and the total number of host cycles C t, we can map EMO onto the LAWS parameters. The most difficult part of the mapping from EMO to LAWS is the fact that the communication overhead o is constant while the percentage offloaded p is variable. Thus, p is a ratio used to compare two different offload schemes. Our offload schemes are modeled with different values for and to reflect this difference. We use C H to represent the amount of protocol processing done on the NIC for a second offload scheme. We assume that is incremented by C H since this is the assumption of the LAWS model. Changes to the actual amount of protocol processing under various offload schemes are reflected in the LAWS model ratio β. Changes to the amount of protocol processing done under various offload schemes are reflected directly in different values in EMO. o O NH O HA C A O NA a C T o X Y C H p o N B T LAWS derives all of its ratios from these parameters with the exception of β. The structural ratio describes the amount of processing saved by using a particular offload scheme. We can quantify this directly from our model assuming the second offload mechanism is denoted by variables with. β C N O NH O NH C H O HA O HA C A C A O NA O NA Now we have all of the necessary elements to map EMO onto LAWS. This is useful for understanding how the EMO model fits into the larger area of offloading of protocol processing in commodity applications. We created EMO for highend computing so we can explore gap and overhead for a message-oriented applications. 4 Model Verification - Initial Results We created EMO as a language for comparing methods of offload, but it can be considered an analytical model as well. Our initial verification concerned interrupt coalescing. We verified our model for latency of UDP with no interrupt coalescing and with default interrupt coalescing. We measured latencies by creating a ping-pong test between Host A and Host B. Host A remains constant throughout the measurements. Host A is a 933 MHz Pentium III running an unmodified Linux kernel with the Acenic Gigabit Ethernet card set to default values for interrupt coalescing and transmit ratios. Host B is the same 6
7 machine connected to Host A by cross-over fiber. Host B also runs an unmodified version of the Linux kernel. We measured overhead by modifying our pingpong test. Host A continues the ping-pong test, but Host B includes a cycle-soaker that counts the number of cycles that can be completed while communication is in progress. 4.1 Latency In order to verify the model for latency, we measured actual latency and approximated measurements for the various parts of the sum for our equation: Traditional Latency L NH L HA Our model is verified to the extent that the sum of the addends approximates the actual measured latency Application to Application Latency In order to measure the traditional latency, we ran a simple udp echo server in user space on Host B. Host A simply measures ping-pong latency for various size messages. We measured this latency from 1 byte messages through 89 byte messages. We wanted to remain within the jumbo frame size to avoid fragmentation and reassembly or multiple packets, but we wanted to exercise the crossing of page boundaries. The page size for the Linux kernel is 4KB. We measured application to application latency when Host B has default interrupt coalescing parameters set and also when Host B had interrupt coalescing turned off Application to NIC Latency In order to measure application to NIC latency we moved the UDP echo server into the Acenic firmware. This allows us to measure the latency of a message as it travels through Host A, across the wire, and to the NIC on Host B. This latency should not reflect the cost of the interrupt on Host B, the cost of moving through the kernel receive or send paths on Host B, nor the cost of the copy of data into user space on Host B. The UDP echoserver exercises all of the code in the traditional udp receive and send paths in the Acenic firmware with the exception of the DMA to the host so we assume that the application to NIC latency measurement includes the portion of the latency calculation. However, it is important to note that the startup time for the DMA engine on the Acenic cards is approximately 5mus. This will be accounted for in the L NH portion of the calculation Application to Kernel Latency For our initial results, we chose to measure the latency between an application on Host A and the kernel on Host B using the ping utility already provided by the kernel. This was an easy measurement to procure and should give a reasonable approximation of the latency between the application and the kernel. While the ping message does not travel the exact code path as a UDP message in the kernel, it does exercise the same IP path and very similar code at the level above IP. The ICMP or ping message does not perform a copy of data to user space and does not perform a route lookup. Unfortunately for our purposes, the ping utility does perform a checksum so the application to kernel measurements will be linearly dependant on the size of the message. This is a major differences. In the future, we would like to add the echoserver into the kernel so as to be as consistent as possible in our calculations. The application to kernel latency was measured with an unmodified Host A with default interrupt coalescing and with Host B with both default interrupt coalescing and no interrupt coalescing. We expect that the differences between interrupt coalescing and no interrupt coalescing should be present at this level. 7
8 4.1.4 Results 6 Default Interrupts No Interrupts The expectation is that the average latency will be generally smaller when there is no interrupt coalescing. This is shown in our model. Interrupt coalescing can be seen as a move to decrease the overhead effect of the interrupt O NH at the cost of the time that an interrupt will reach the host. This means we expect that L NH is the variable affected. We cannot currently fully verify this result. We have no reasonable way to isolate the processing on the NIC and have not fully measured processing time or overhead in the kernel. However, Figure 2 shows that generally latency is slightly lower when Host B disables interrupt coalescing. Moreover, Figure 3 further isolates this phenomenon by approximating the ping-pong measurement without the final move from kernel to application space. If we let X be the latency of all communication on Host A, the wire and the NIC on Host B, then Figure 3 represents a comparison between: X L NH and the interrupt coalescing latency X L NH Avg latency in usec Figure 3. Application to Kernel Latency where X represents the latency of all communication on Host A and the wire. Our original intent was to use the application to NIC measurements to isolate X and. However, Figure 4 shows clearly that our application to NIC measurements are not useful. Because the slope is greater than the slope of the actual latency, it appears that our udp echoserver is touching the data more often than the traditional application echoserver. At the very least, the udp echoserver is clearly touching the data and the speed of the NIC (88MHz) is exacerbating the slowdown. We expect this to be a temporary problem as there is no design issue that precludes returning the data without touching it. This is simply an implementation flaw. 6 Default Interrupts No Interrupts 8 7 Actual ToKernel ToNic 5 6 Avg latency in usec 4 3 Avg latency in usec Figure 2. Application to Application Latency Our actual latency can be represented as X L NH L HA Figure 4. Latency with No Interrupt Coalescing Figure 4 also presents us with another artifact 8
9 dissimilar from the model. From our preliminary discussions of EMO, we assumed that L HA is linear with the size of the message because the copy of data occurs here. However, the application to kernel route is clearly also touching all data. This is presumably because the kernel is checksumming the IP packets and therefore touching the data as well. Clearly, the the udp echoserver on the NIC must be rendered useful and the application to kernel path cannot include the checksum in order for us to fully verify our model. However, the insights provided by the model for interrupt coalescing are verified by the measurements. 4.2 Overhead In order to verify the model for overhead, we measured actual overhead and approximated measurements for the various parts of the sum for our equation: Traditional Overhead O NH O HA Our model is verified to the extent that the sum of the addends approximates the actual measured overhead Application to Application Overhead We measured the amount of cycle-soak work Host B can do without any communication occurring. Then we measured the amount of cycle-soak work Host B can do with standard ping-pong communication of various sized messages occurring between an application on Host A and a udp echoserver on Host B. The difference between these to amounts of work is the overhead associated with the communication. It is the the number of cycles being taken away from calculation. We measured the overhead of application to application communication with default interrupts on Host A and with both default interrupts on Host B and with no interrupts on Host B. We expect that the overhead of application to application communication when Host B is using interrupt coalescing will be lower than when Host B is not using interrupt coalescing Kernel to Application Overhead In order to measure the overhead for kernel to application communication, Host A ran a ping flood on Host B and Host B ran the cycle-soak work calculation. We expect that interrupt coalescing will still make a difference at this level of communication so that Host B with no interrupt coalescing will have higher overhead than Host B with default interrupt coalescing. However, we do not expect the size of the message to make as much of a difference in the communication overhead at this level as it does at the application to application communication level NIC to Application Overhead In order to measure the overhead for application to NIC communication, Host B is run with the modified Acenic firmware with the UDP echoserver at the NIC level. Host A runs the UDP ping-pong test and Host B runs the cycle-soak work calculation. We expect quite low overhead on Host B as there is no host involvement with the communication and therefore no communication overhead Results As expected, there was no communication overhead when the UDP echoserver runs at the NIC level. This verified our modeled expectations. We expected that the overhead for application to application communication would be lower when Host B employed interrupt coalescing. Figure 5 difference is negligible at best, but this reflects general results regarding interrupt coalescing and its efficacy in lowering overhead[1]. 9
10 Moreover, Figure 6 shows that this effect also follows through the kernel to application communication path as expected since the interrupt is still present in this path. message. These results should bring a more clear understanding of the role of the memory subsystem in EMO. 6e+8 Actual ToKernel 1.6e+8 1.4e+8 Default Interrupts No Interrupts 5e+8 Avg overhead in cycles 1.2e+8 1e+8 8e+7 6e+7 Avg overhead in cycles 4e+8 3e+8 2e+8 1e+8 4e+7 2e+7 Figure 5. Application to Application Overheads Figure 7. Overhead with Default Interrupt Coalescing 1.6e+8 Default Interrupts No Interrupts 5 Conclusions and Future Work Avg overhead in cycles 1.4e+8 1.2e+8 1e+8 8e+7 6e+7 4e+7 2e+7 Figure 6. Kernel to Application Overheads The most interesting result, however, does not verify our assumptions regarding the EMO model. The overhead for application to application latency does not increase with the size of the message as we expected. Figure 7 shows the gap that represents O HA remains constant rather than increasing as the size of the message increases as expected. The overall overhead also remains constant. Clearly, for small messages, the overhead is predominantly the interrupt overhead. Measurements for much larger messages should reveal overheads that begin to slope with the size of the The extensible message-oriented offload model (EMO) allows us to explore the space of network protocol implementation from the application messaging layers through to the NIC on a message by message basis. This new language gives us a fresh understanding of the role of offloading in terms of overhead, latency and gap in high-performance systems. The preliminary work on verification has yielded areas of further research. A more effective means of isolating various parts of the message path must be created before verification can be complete. The next steps are a profiling of the kernel and a reimplementation of the udp echoserver at the NIC and at the kernel. Clearly, the model remains conceptual until all variables can be isolated and all assumptions can be verified. EMO as a language for exploring offload design is already being used. We plan on exploring the model of gap in EMO to bound the resource requirements for NICs or TCP offload engines at 1Gb/s speeds. We plan also to extend EMO to 1
11 include memory management considerations such as caching. References [1] P. Gilfeather and T. Underwood. Fragmentation and high performance ip. In Proc. of the 15th International Parallel and Distributed Processing Symposium, April 21. [2] P. Shivam and J. Chase. On the elusive benefits of protocol offload. In SIGCOMM workshop on Network-I/O Convergence: Experience, Lessons, Implications (NICELI), August
An Extensible Message-Oriented Offload Model for High-Performance Applications
An Extensible Message-Oriented Offload Model for High-Performance Applications Patricia Gilfeather and Arthur B. Maccabe Scalable Systems Lab Department of Computer Science University of New Mexico pfeather@cs.unm.edu,
More informationMaking TCP Viable as a High Performance Computing Protocol
Making TCP Viable as a High Performance Computing Protocol Patricia Gilfeather and Arthur B. Maccabe Scalable Systems Lab Department of Computer Science University of New Mexico pfeather@cs.unm.edu maccabe@cs.unm.edu
More informationMotivation CPUs can not keep pace with network
Deferred Segmentation For Wire-Speed Transmission of Large TCP Frames over Standard GbE Networks Bilic Hrvoye (Billy) Igor Chirashnya Yitzhak Birk Zorik Machulsky Technion - Israel Institute of technology
More informationIntroduction to TCP/IP Offload Engine (TOE)
Introduction to TCP/IP Offload Engine (TOE) Version 1.0, April 2002 Authored By: Eric Yeh, Hewlett Packard Herman Chao, QLogic Corp. Venu Mannem, Adaptec, Inc. Joe Gervais, Alacritech Bradley Booth, Intel
More informationExperience in Offloading Protocol Processing to a Programmable NIC
Experience in Offloading Protocol Processing to a Programmable NIC Arthur B. Maccabe, Wenbin Zhu Computer Science Department The University of New Mexico Albuquerque, NM 87131 Jim Otto, Rolf Riesen Scalable
More informationSPLINTERING TCP TO DECREASE SMALL MESSAGE LATENCY IN HIGH-PERFORMANCE COMPUTING BREANNE DUNCAN THESIS
SPLINTERING TCP TO DECREASE SMALL MESSAGE LATENCY IN HIGH-PERFORMANCE COMPUTING by BREANNE DUNCAN THESIS Submitted in Partial Fulfillment of the Requirements for the Degree of Bachelors of Science Computer
More informationOutrunning Moore s Law Can IP-SANs close the host-network gap? Jeff Chase Duke University
Outrunning Moore s Law Can IP-SANs close the host-network gap? Jeff Chase Duke University But first. This work addresses questions that are important in the industry right now. It is an outgrowth of Trapeze
More informationThe NE010 iwarp Adapter
The NE010 iwarp Adapter Gary Montry Senior Scientist +1-512-493-3241 GMontry@NetEffect.com Today s Data Center Users Applications networking adapter LAN Ethernet NAS block storage clustering adapter adapter
More informationA Simulation: Improving Throughput and Reducing PCI Bus Traffic by. Caching Server Requests using a Network Processor with Memory
Shawn Koch Mark Doughty ELEC 525 4/23/02 A Simulation: Improving Throughput and Reducing PCI Bus Traffic by Caching Server Requests using a Network Processor with Memory 1 Motivation and Concept The goal
More informationMeasurement-based Analysis of TCP/IP Processing Requirements
Measurement-based Analysis of TCP/IP Processing Requirements Srihari Makineni Ravi Iyer Communications Technology Lab Intel Corporation {srihari.makineni, ravishankar.iyer}@intel.com Abstract With the
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More informationMemory Management Strategies for Data Serving with RDMA
Memory Management Strategies for Data Serving with RDMA Dennis Dalessandro and Pete Wyckoff (presenting) Ohio Supercomputer Center {dennis,pw}@osc.edu HotI'07 23 August 2007 Motivation Increasing demands
More informationNetworking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ
Networking for Data Acquisition Systems Fabrice Le Goff - 14/02/2018 - ISOTDAQ Outline Generalities The OSI Model Ethernet and Local Area Networks IP and Routing TCP, UDP and Transport Efficiency Networking
More informationQuickSpecs. HP Z 10GbE Dual Port Module. Models
Overview Models Part Number: 1Ql49AA Introduction The is a 10GBASE-T adapter utilizing the Intel X722 MAC and X557-AT2 PHY pairing to deliver full line-rate performance, utilizing CAT 6A UTP cabling (or
More informationAdvanced Computer Networks. End Host Optimization
Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct
More informationby Brian Hausauer, Chief Architect, NetEffect, Inc
iwarp Ethernet: Eliminating Overhead In Data Center Designs Latest extensions to Ethernet virtually eliminate the overhead associated with transport processing, intermediate buffer copies, and application
More informationMultifunction Networking Adapters
Ethernet s Extreme Makeover: Multifunction Networking Adapters Chuck Hudson Manager, ProLiant Networking Technology Hewlett-Packard 2004 Hewlett-Packard Development Company, L.P. The information contained
More informationOptimizing Performance: Intel Network Adapters User Guide
Optimizing Performance: Intel Network Adapters User Guide Network Optimization Types When optimizing network adapter parameters (NIC), the user typically considers one of the following three conditions
More informationImpact of Cache Coherence Protocols on the Processing of Network Traffic
Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background
More informationIsoStack Highly Efficient Network Processing on Dedicated Cores
IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single
More informationCommunication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.
Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance
More informationThe latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication
The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication John Markus Bjørndalen, Otto J. Anshus, Brian Vinter, Tore Larsen Department of Computer Science University
More informationImproving the Database Logging Performance of the Snort Network Intrusion Detection Sensor
-0- Improving the Database Logging Performance of the Snort Network Intrusion Detection Sensor Lambert Schaelicke, Matthew R. Geiger, Curt J. Freeland Department of Computer Science and Engineering University
More informationInitial Evaluation of a User-Level Device Driver Framework
Initial Evaluation of a User-Level Device Driver Framework Stefan Götz Karlsruhe University Germany sgoetz@ira.uka.de Kevin Elphinstone National ICT Australia University of New South Wales kevine@cse.unsw.edu.au
More information10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G
10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G Mohammad J. Rashti and Ahmad Afsahi Queen s University Kingston, ON, Canada 2007 Workshop on Communication Architectures
More informationPerformance Evaluation of Myrinet-based Network Router
Performance Evaluation of Myrinet-based Network Router Information and Communications University 2001. 1. 16 Chansu Yu, Younghee Lee, Ben Lee Contents Suez : Cluster-based Router Suez Implementation Implementation
More informationLighting the Blue Touchpaper for UK e-science - Closing Conference of ESLEA Project The George Hotel, Edinburgh, UK March, 2007
Working with 1 Gigabit Ethernet 1, The School of Physics and Astronomy, The University of Manchester, Manchester, M13 9PL UK E-mail: R.Hughes-Jones@manchester.ac.uk Stephen Kershaw The School of Physics
More informationLANCOM Techpaper IEEE n Indoor Performance
Introduction The standard IEEE 802.11n features a number of new mechanisms which significantly increase available bandwidths. The former wireless LAN standards based on 802.11a/g enable physical gross
More informationMiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces
MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces Hye-Churn Jang Hyun-Wook (Jin) Jin Department of Computer Science and Engineering Konkuk University Seoul, Korea {comfact,
More information19: Networking. Networking Hardware. Mark Handley
19: Networking Mark Handley Networking Hardware Lots of different hardware: Modem byte at a time, FDDI, SONET packet at a time ATM (including some DSL) 53-byte cell at a time Reality is that most networking
More informationDistributing Application and OS Functionality to Improve Application Performance
Distributing Application and OS Functionality to Improve Application Performance Arthur B. Maccabe, William Lawry, Christopher Wilson, Rolf Riesen April 2002 Abstract In this paper we demonstrate that
More informationThe Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook)
Workshop on New Visions for Large-Scale Networks: Research & Applications Vienna, VA, USA, March 12-14, 2001 The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook) Wu-chun Feng feng@lanl.gov
More informationNetwork Design Considerations for Grid Computing
Network Design Considerations for Grid Computing Engineering Systems How Bandwidth, Latency, and Packet Size Impact Grid Job Performance by Erik Burrows, Engineering Systems Analyst, Principal, Broadcom
More informationLinux Network Tuning Guide for AMD EPYC Processor Based Servers
Linux Network Tuning Guide for AMD EPYC Processor Application Note Publication # 56224 Revision: 1.00 Issue Date: November 2017 Advanced Micro Devices 2017 Advanced Micro Devices, Inc. All rights reserved.
More information440GX Application Note
Overview of TCP/IP Acceleration Hardware January 22, 2008 Introduction Modern interconnect technology offers Gigabit/second (Gb/s) speed that has shifted the bottleneck in communication from the physical
More informationSolace Message Routers and Cisco Ethernet Switches: Unified Infrastructure for Financial Services Middleware
Solace Message Routers and Cisco Ethernet Switches: Unified Infrastructure for Financial Services Middleware What You Will Learn The goal of zero latency in financial services has caused the creation of
More informationBrent Callaghan Sun Microsystems, Inc. Sun Microsystems, Inc
Brent Callaghan. brent@eng.sun.com Page 1 of 19 A Problem: Data Center Performance CPU 1 Gb Fibre Channel 100 MB/sec Storage Array CPU NFS 1 Gb Ethernet 50 MB/sec (via Gigaswift) NFS Server Page 2 of 19
More informationLAPI on HPS Evaluating Federation
LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of
More informationRDMA over Commodity Ethernet at Scale
RDMA over Commodity Ethernet at Scale Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitendra Padhye, Marina Lipshteyn ACM SIGCOMM 2016 August 24 2016 Outline RDMA/RoCEv2 background DSCP-based
More informationLessons learned from MPI
Lessons learned from MPI Patrick Geoffray Opinionated Senior Software Architect patrick@myri.com 1 GM design Written by hardware people, pre-date MPI. 2-sided and 1-sided operations: All asynchronous.
More informationReduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection
Switching Operational modes: Store-and-forward: Each switch receives an entire packet before it forwards it onto the next switch - useful in a general purpose network (I.e. a LAN). usually, there is a
More informationThe Case for RDMA. Jim Pinkerton RDMA Consortium 5/29/2002
The Case for RDMA Jim Pinkerton RDMA Consortium 5/29/2002 Agenda What is the problem? CPU utilization and memory BW bottlenecks Offload technology has failed (many times) RDMA is a proven sol n to the
More informationUse of the Internet SCSI (iscsi) protocol
A unified networking approach to iscsi storage with Broadcom controllers By Dhiraj Sehgal, Abhijit Aswath, and Srinivas Thodati In environments based on Internet SCSI (iscsi) and 10 Gigabit Ethernet, deploying
More informationIntroduction to Ethernet Latency
Introduction to Ethernet Latency An Explanation of Latency and Latency Measurement The primary difference in the various methods of latency measurement is the point in the software stack at which the latency
More information6.9. Communicating to the Outside World: Cluster Networking
6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and
More informationOptimizing TCP Receive Performance
Optimizing TCP Receive Performance Aravind Menon and Willy Zwaenepoel School of Computer and Communication Sciences EPFL Abstract The performance of receive side TCP processing has traditionally been dominated
More informationHIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS
HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS CS6410 Moontae Lee (Nov 20, 2014) Part 1 Overview 00 Background User-level Networking (U-Net) Remote Direct Memory Access
More informationMyri-10G Myrinet Converges with Ethernet
Myri-10G Myrinet Converges with Ethernet David PeGan VP, Sales dave@myri.com (Substituting for Tom Leinberger) 4 October 2006 Oklahoma Supercomputing Symposium 1 New Directions for Myricom Although Myricom
More informationVirtualization, Xen and Denali
Virtualization, Xen and Denali Susmit Shannigrahi November 9, 2011 Susmit Shannigrahi () Virtualization, Xen and Denali November 9, 2011 1 / 70 Introduction Virtualization is the technology to allow two
More informationMultimedia Streaming. Mike Zink
Multimedia Streaming Mike Zink Technical Challenges Servers (and proxy caches) storage continuous media streams, e.g.: 4000 movies * 90 minutes * 10 Mbps (DVD) = 27.0 TB 15 Mbps = 40.5 TB 36 Mbps (BluRay)=
More informationKey Measures of InfiniBand Performance in the Data Center. Driving Metrics for End User Benefits
Key Measures of InfiniBand Performance in the Data Center Driving Metrics for End User Benefits Benchmark Subgroup Benchmark Subgroup Charter The InfiniBand Benchmarking Subgroup has been chartered by
More informationStorage Protocol Offload for Virtualized Environments Session 301-F
Storage Protocol Offload for Virtualized Environments Session 301-F Dennis Martin, President August 2016 1 Agenda About Demartek Offloads I/O Virtualization Concepts RDMA Concepts Overlay Networks and
More informationNetwork Test and Monitoring Tools
ajgillette.com Technical Note Network Test and Monitoring Tools Author: A.J.Gillette Date: December 6, 2012 Revision: 1.3 Table of Contents Network Test and Monitoring Tools...1 Introduction...3 Link Characterization...4
More informationRiceNIC. Prototyping Network Interfaces. Jeffrey Shafer Scott Rixner
RiceNIC Prototyping Network Interfaces Jeffrey Shafer Scott Rixner RiceNIC Overview Gigabit Ethernet Network Interface Card RiceNIC - Prototyping Network Interfaces 2 RiceNIC Overview Reconfigurable and
More informationApplication Acceleration Beyond Flash Storage
Application Acceleration Beyond Flash Storage Session 303C Mellanox Technologies Flash Memory Summit July 2014 Accelerating Applications, Step-by-Step First Steps Make compute fast Moore s Law Make storage
More informationRoCE vs. iwarp Competitive Analysis
WHITE PAPER February 217 RoCE vs. iwarp Competitive Analysis Executive Summary...1 RoCE s Advantages over iwarp...1 Performance and Benchmark Examples...3 Best Performance for Virtualization...5 Summary...6
More informationQuickSpecs. Overview. HPE Ethernet 10Gb 2-port 535 Adapter. HPE Ethernet 10Gb 2-port 535 Adapter. 1. Product description. 2.
Overview 1. Product description 2. Product features 1. Product description HPE Ethernet 10Gb 2-port 535FLR-T adapter 1 HPE Ethernet 10Gb 2-port 535T adapter The HPE Ethernet 10GBase-T 2-port 535 adapters
More informationIntel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances
Technology Brief Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances The world
More informationRiceNIC. A Reconfigurable Network Interface for Experimental Research and Education. Jeffrey Shafer Scott Rixner
RiceNIC A Reconfigurable Network Interface for Experimental Research and Education Jeffrey Shafer Scott Rixner Introduction Networking is critical to modern computer systems Role of the network interface
More informationNFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications
NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications Outline RDMA Motivating trends iwarp NFS over RDMA Overview Chelsio T5 support Performance results 2 Adoption Rate of 40GbE Source: Crehan
More informationAn FPGA-Based Optical IOH Architecture for Embedded System
An FPGA-Based Optical IOH Architecture for Embedded System Saravana.S Assistant Professor, Bharath University, Chennai 600073, India Abstract Data traffic has tremendously increased and is still increasing
More informationMemory Management. q Basic memory management q Swapping q Kernel memory allocation q Next Time: Virtual memory
Memory Management q Basic memory management q Swapping q Kernel memory allocation q Next Time: Virtual memory Memory management Ideal memory for a programmer large, fast, nonvolatile and cheap not an option
More informationImplementation and Analysis of Large Receive Offload in a Virtualized System
Implementation and Analysis of Large Receive Offload in a Virtualized System Takayuki Hatori and Hitoshi Oi The University of Aizu, Aizu Wakamatsu, JAPAN {s1110173,hitoshi}@u-aizu.ac.jp Abstract System
More informationComparing Server I/O Consolidation Solutions: iscsi, InfiniBand and FCoE. Gilles Chekroun Errol Roberts
Comparing Server I/O Consolidation Solutions: iscsi, InfiniBand and FCoE Gilles Chekroun Errol Roberts SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies
More informationHP Cluster Interconnects: The Next 5 Years
HP Cluster Interconnects: The Next 5 Years Michael Krause mkrause@hp.com September 8, 2003 2003 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
More informationMessage Passing Architecture in Intra-Cluster Communication
CS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan @cs.ucr.edu February 8, 2004 UC Riverside Slide 1 CS213 Outline 1 Kernel-based Message Passing
More informationLarge Receive Offload implementation in Neterion 10GbE Ethernet driver
Large Receive Offload implementation in Neterion 10GbE Ethernet driver Leonid Grossman Neterion, Inc. leonid@neterion.com Abstract 1 Introduction The benefits of TSO (Transmit Side Offload) implementation
More information打造 Linux 下的高性能网络 北京酷锐达信息技术有限公司技术总监史应生.
打造 Linux 下的高性能网络 北京酷锐达信息技术有限公司技术总监史应生 shiys@solutionware.com.cn BY DEFAULT, LINUX NETWORKING NOT TUNED FOR MAX PERFORMANCE, MORE FOR RELIABILITY Trade-off :Low Latency, throughput, determinism Performance
More informationCERN openlab Summer 2006: Networking Overview
CERN openlab Summer 2006: Networking Overview Martin Swany, Ph.D. Assistant Professor, Computer and Information Sciences, U. Delaware, USA Visiting Helsinki Institute of Physics (HIP) at CERN swany@cis.udel.edu,
More informationImproving Cluster Performance
Improving Cluster Performance Service Offloading Larger clusters may need to have special purpose node(s) to run services to prevent slowdown due to contention (e.g. NFS, DNS, login, compilation) In cluster
More informationInitial Performance Evaluation of the Cray SeaStar Interconnect
Initial Performance Evaluation of the Cray SeaStar Interconnect Ron Brightwell Kevin Pedretti Keith Underwood Sandia National Laboratories Scalable Computing Systems Department 13 th IEEE Symposium on
More informationNetworking interview questions
Networking interview questions What is LAN? LAN is a computer network that spans a relatively small area. Most LANs are confined to a single building or group of buildings. However, one LAN can be connected
More informationWhy Your Application only Uses 10Mbps Even the Link is 1Gbps?
Why Your Application only Uses 10Mbps Even the Link is 1Gbps? Contents Introduction Background Information Overview of the Issue Bandwidth-Delay Product Verify Solution How to Tell Round Trip Time (RTT)
More informationCommunication Networks ( ) / Fall 2013 The Blavatnik School of Computer Science, Tel-Aviv University. Allon Wagner
Communication Networks (0368-3030) / Fall 2013 The Blavatnik School of Computer Science, Tel-Aviv University Allon Wagner Kurose & Ross, Chapter 4 (5 th ed.) Many slides adapted from: J. Kurose & K. Ross
More informationMulticomputer distributed system LECTURE 8
Multicomputer distributed system LECTURE 8 DR. SAMMAN H. AMEEN 1 Wide area network (WAN); A WAN connects a large number of computers that are spread over large geographic distances. It can span sites in
More informationINT 1011 TCP Offload Engine (Full Offload)
INT 1011 TCP Offload Engine (Full Offload) Product brief, features and benefits summary Provides lowest Latency and highest bandwidth. Highly customizable hardware IP block. Easily portable to ASIC flow,
More informationOperating Systems 2010/2011
Operating Systems 2010/2011 Input/Output Systems part 2 (ch13, ch12) Shudong Chen 1 Recap Discuss the principles of I/O hardware and its complexity Explore the structure of an operating system s I/O subsystem
More informationTopic & Scope. Content: The course gives
Topic & Scope Content: The course gives an overview of network processor cards (architectures and use) an introduction of how to program Intel IXP network processors some ideas of how to use network processors
More informationUtilizing Linux Kernel Components in K42 K42 Team modified October 2001
K42 Team modified October 2001 This paper discusses how K42 uses Linux-kernel components to support a wide range of hardware, a full-featured TCP/IP stack and Linux file-systems. An examination of the
More informationMaximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD
Maximum Performance How to get it and how to avoid pitfalls Christoph Lameter, PhD cl@linux.com Performance Just push a button? Systems are optimized by default for good general performance in all areas.
More informationTroubleshooting High CPU Caused by the BGP Scanner or BGP Router Process
Troubleshooting High CPU Caused by the BGP Scanner or BGP Router Process Document ID: 107615 Contents Introduction Before You Begin Conventions Prerequisites Components Used Understanding BGP Processes
More informationProfiling the Performance of TCP/IP on Windows NT
Profiling the Performance of TCP/IP on Windows NT P.Xie, B. Wu, M. Liu, Jim Harris, Chris Scheiman Abstract This paper presents detailed network performance measurements of a prototype implementation of
More informationAccelerating Web Protocols Using RDMA
Accelerating Web Protocols Using RDMA Dennis Dalessandro Ohio Supercomputer Center NCA 2007 Who's Responsible for this? Dennis Dalessandro Ohio Supercomputer Center - Springfield dennis@osc.edu Pete Wyckoff
More informationiscsi Technology: A Convergence of Networking and Storage
HP Industry Standard Servers April 2003 iscsi Technology: A Convergence of Networking and Storage technology brief TC030402TB Table of Contents Abstract... 2 Introduction... 2 The Changing Storage Environment...
More informationConnection Handoff Policies for TCP Offload Network Interfaces
Connection Handoff Policies for TCP Offload Network Interfaces Hyong-youb Kim and Scott Rixner Rice University Houston, TX 77005 {hykim, rixner}@rice.edu Abstract This paper presents three policies for
More informationIX: A Protected Dataplane Operating System for High Throughput and Low Latency
IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this
More informationRICE UNIVERSITY. High Performance MPI Libraries for Ethernet. Supratik Majumder
RICE UNIVERSITY High Performance MPI Libraries for Ethernet by Supratik Majumder A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree MASTER OF SCIENCE Approved, Thesis Committee:
More informationInfiniBand Networked Flash Storage
InfiniBand Networked Flash Storage Superior Performance, Efficiency and Scalability Motti Beck Director Enterprise Market Development, Mellanox Technologies Flash Memory Summit 2016 Santa Clara, CA 1 17PB
More informationRDMA and Hardware Support
RDMA and Hardware Support SIGCOMM Topic Preview 2018 Yibo Zhu Microsoft Research 1 The (Traditional) Journey of Data How app developers see the network Under the hood This architecture had been working
More informationUsing Switches with a PS Series Group
Cisco Catalyst 3750 and 2970 Switches Using Switches with a PS Series Group Abstract This Technical Report describes how to use Cisco Catalyst 3750 and 2970 switches with a PS Series group to create a
More informationDeploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c
White Paper Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c What You Will Learn This document demonstrates the benefits
More informationIntroduction to Computer Networks. CS 166: Introduction to Computer Systems Security
Introduction to Computer Networks CS 166: Introduction to Computer Systems Security Network Communication Communication in modern networks is characterized by the following fundamental principles Packet
More informationProfiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA Motivation Scientific experiments are generating large amounts of data Education
More informationThe Convergence of Storage and Server Virtualization Solarflare Communications, Inc.
The Convergence of Storage and Server Virtualization 2007 Solarflare Communications, Inc. About Solarflare Communications Privately-held, fabless semiconductor company. Founded 2001 Top tier investors:
More informationTitan: Fair Packet Scheduling for Commodity Multiqueue NICs. Brent Stephens, Arjun Singhvi, Aditya Akella, and Mike Swift July 13 th, 2017
Titan: Fair Packet Scheduling for Commodity Multiqueue NICs Brent Stephens, Arjun Singhvi, Aditya Akella, and Mike Swift July 13 th, 2017 Ethernet line-rates are increasing! 2 Servers need: To drive increasing
More informationThe Interconnection Structure of. The Internet. EECC694 - Shaaban
The Internet Evolved from the ARPANET (the Advanced Research Projects Agency Network), a project funded by The U.S. Department of Defense (DOD) in 1969. ARPANET's purpose was to provide the U.S. Defense
More informationScribe Notes -- October 31st, 2017
Scribe Notes -- October 31st, 2017 TCP/IP Protocol Suite Most popular protocol but was designed with fault tolerance in mind, not security. Consequences of this: People realized that errors in transmission
More informationIdentifying the Sources of Latency in a Splintered Protocol
Identifying the Sources of Latency in a Splintered Protocol Wenbin Zhu, Arthur B. Maccabe Computer Science Department The University of New Mexico Albuquerque, NM 87131 Rolf Riesen Scalable Computing Systems
More informationLecture 3. The Network Layer (cont d) Network Layer 1-1
Lecture 3 The Network Layer (cont d) Network Layer 1-1 Agenda The Network Layer (cont d) What is inside a router? Internet Protocol (IP) IPv4 fragmentation and addressing IP Address Classes and Subnets
More informationPerformance Optimisations for HPC workloads. August 2008 Imed Chihi
Performance Optimisations for HPC workloads August 2008 Imed Chihi Agenda The computing model The assignment problem CPU sets Priorities Disk IO optimisations gettimeofday() Disabling services Memory management
More information