Performance Characterization of TCP/IP Packet Processing in Commercial Server Workloads

Size: px
Start display at page:

Download "Performance Characterization of TCP/IP Packet Processing in Commercial Server Workloads"

Transcription

1 Performance Characterization of TCP/IP Packet Processing in Commercial Server Workloads Srihari Makineni Ravi Iyer Network Architecture Laboratory Intel Corporation {sriharimakineni, ravishankar. Abstract -- TCP/IP is the communication protocol o/choice /or many current and next generation server applicationr (web services, e-commerce, storage, etc.). As o result, the per/ormance o/these opplications can be heavily dependent on the eflcient TCP/IP packet processing within the termination nodes. Motivated by this. our work presented in this paper focusej on analyzing the underlying architectural characteristics o/ TCPIIP packet processing and quantiaing the computolional requirements of the TCPNP packet processing component within current server workloads. Our analysis and characterization methodology is based on in-depth mearurement experiments o/ TCPIIP packet processing performance on Intel 's slate-ojthe-art low-power Penlium(r) M microprocessor running the Microsot? Windows' Server 2003 operating system. We start by analyzing the impact o/nic/eatures such as Large Segment Ofload and the use u/jumbo/romes on TCP/IPpacket processing per/onnonce. We then show /hot the architecturol characteristics of transmit-side processitig (largely compurebound) ore signijicant/y different than receive-side processing (mostly memory-bound). Fino/& we quontza the computotional requirements /or sendingheceiving packets within commercial workloads (SPECweb99. TPC-C and TPC-W) and show that they can form a substantial component. I. INTRODUCTION As Intemet presence and connectivity continues to increase at a rapid pace, there is a growing need to increase the available end-to-end communication bandwidth and performance. TCP/IP [I91 over Ethemet is the most prevalent form of communication that exists today in various environments. Commercial web and e- commerce server benchmarks such as SPECweb99 and TPC-W illustrate the amount of communication that is needed between the clients and servers as well as among the multiple tiers of servers involved in processing the transaction. In addition, hack-end database workloads like TPC-C exhibit significant storage I/O that translates into network traffic when using IP-based storage servers. One of the well-recognized issues with TCPlIP communication is the significant computational requirements of protocol stack processing at the termination nodes (clients and servers). However, there are very'few architectural studies in the literature [3,4, 61 that present details on the performance characteristics and computational requirements of TCP/IP protocol processing. Moreover, there seems be very little information relating packet processing to such commercial server workloads. In related work, several research projects contemplate the potential of using a TCPflP offload engine (TOE) or a packet processing engine (PPE) [I, 2, 5,9, 14,20,21] to offload protocol stack processing from the host CPU. The relevance of packet processing [17, 18, 241 grows stronger as Storage over IP becomes popular with the help of working groups for iscs1 [I I], RDMA [22] and DDP [23]. While these efforts provide valuable information, the study of micro-architectural characteristics and bottlenecks in stack processing is critical to help assist the architectural considerations and design choices for a PPE. Another direction of research that is equally important is the need to identify architectural improvements to general purpose cores that could also help improve the efficiency and performance of packet processing. The overall contributions of this paper are as follows. We perform an in-depth measurement and evaluation of packet processing on the state-of-the-art low power Intel@ PentiumB M microprocessor [7] running the Microsott's Server 2003 Enterprise Edition operating system. We chose the PentiumO M processor as it is gaining momentum in.the low power server blade segment [29]. The primary focus of this study is to do the following: (a) to expose the performance / architectural characteristics of packet processing and (b) to quantify the computational requirements of the TCP/IP packet processing component within server workloads. In this paper, we start by describing the measurement-based methodology used to achieve these goals. We then present an in-depth analysis of the performance data (CPI, path length, cache misses, branch prediction accuracy, etc) collected through application-level measurements and the use of in-built performance counters available on the Pentium@ M microprocessor. Finally, we compare the requirements of packet processing to the overall characteristics of commercial server workloads (SPECweb99, TPC-C and TPC-W) and discuss the potential opportunities for optimizing packet processing performance /03/$ IEEE 33

2 The rest of,this paper is organized as follows. Section 11 discusses the server networking requirements and presents an overview of TCP/IP packet processing components. Section 111 describes the detailed measurement-based methodology that we employed. Section IV presents an overview of TCPlIP packet processing performance and delves into the architectural characteristics of both transmit-side and receive-side processing. Section V relates the packet processing requirements to commercial workload execution. Section VI summarizes our inferences and concludes with a direction towards future work in this area. 37E (#I fscw.tm~*m *IY,.,IIY~*. /.w.. I(..-. ow i*voy rm*e*wa*d -.o* 8h.I 11. TCP/IP PACKET PROCESSING IN SERVER WORKLOADS In this section, we start by discussing the networking requirements for server workloads to motivate the need to characterize TCPlIP packet processing performance. We then discuss the components of interest when evaluating TCP/IP packet processing. A. Server Networking Requirements We have chosen three commercial server workloads (SPECweb99 [25], TPC-W [28] and TPC-C [27]) that are considered representative of front-end and back-end server segments. Figure 1 illustrates the types of incoming/outgoing network traffic for these workloads. SPECweb99 (251 is a benchmark that mimics the load in a web server environment. The benchmark setup uses multiple client systems to generate aggregate load (70% static and 30% dynamic) to a web server. SPECweb99 also makes use of persistent connections in order to minimize connect setups and teardowns. The latest results at the SPEC site ( show that current DP servers achieve 5000 to 6000 simultaneous connections. A primary requirement for SPECweh99 is that the network rate needs to he within 320Kbps and 400Kbps per connection. Based on measurement and tuning experiments' in our labs, we find that the rate that we generally observe on a DP Xeon system is 33Okbps. Based on this estimate, Figure 2(a) shows the network bandwidth requirements for the SPECweb99 benchmark as a function of the achieved performance (number of simultaneous connections). The relationship of network rate to performance is basically linear with respect to performance. It should also be noted that SPECweb99's network traffic is dominant on' the transmit-side, with an average application buffer size (average file size) ofroughly 14K bytes. TPC-W [28] is an e-commerce benchmark that models an I It shavld br nofd hi ulc cmw=#a! bmrhmh exg"mrnlr chat we performed in w lab do MI mec-dy mm the 8rlricL 'cponing I auditing m h nor do thri fully conform ID Ihr $wibrnion. Th& usz in IhV wgr is only m vndrnlinding mme hhnvbnl rhmcfrrinkr mdihr=f~ihovldbslrciledurlnh. CI(PC.Csd.*F.r. Figure 1. Networking Aspect of Server Workloads online hook store. The benchmark setup uses multiple client systems (remote browser emulators) to generate transactions that are typically seen in an e-commerce set up consisting of multiple servers (web servers, image servers and hack-end database servers). TPC-W specifies 14 different types of web interactions that require different amounts of processing on the system(s) under test (SUT). Additionally, it should be noted that each interaction requires multiple requests and responses (for the HTML frames, the images and others) between the clients, the web server, the image server and the back-end database server. The performance is measured in terms of the number of web interactions performed per second (WIPS) with some constraints on response time (specified on a per transaction basis) and the scale factor specified in terms of the number of items in the bookstore. The minimum size of the HTML code returned by the TPC-W front-end server averages around 3.3KB (ranging from slightly under IKB to around 8KB). In addition, there are several images (thumbnails, buttons, and detailed product images) that range from 4KB to 20KB in size. TPC-W experiments in our lab (for a scale factor of items) show that the SUT receives roughly 2.5KB of network data from the clients, while it transmits around 46KB of data to the clients. Typically, the front-end SUT configuration consists of a web server and an image server. Based on TPC-W measurement experiments', the outgoing network data is broken up into roughly 24KB per WIP transmitted by the web server and 22KB per WIP transmitted by the image server. The breakup of incoming network data from the clients is around 1.6KB to the web server and 0.9KB to the image server. In addition to the communication to the clients, the web server also communicates with the database server to query the bookstore database. In Figure 2 (b), we focus on the TPC-W web server and show how the network bandwidth requirement changes with increase in performance. 34

3 Figure 2(c) shows the potential networking requirements as a function of performance (specified in transactions per minute - tpmc). It should be noted that the storage I/Os per transaction depends not only on the performance level but also on the memory capacity available on the platform. We should also note that the application buffer size received is basically a physical page. 2.c B. Overview oftcp/ip Communicalion Having looked at the networking requirements of commercial server workloads, we now turn our attention to the TCPnP communication protocol since it is essentially the protocol of choice for most of these environments. In this subsection, we look at the components of TCP/IP processing and the modes of operations available on current network cards. (b) TPC-W-Based Estimates P.*,mm- VMCI. ~..~~. (c) TPC-C-Based Estimates Figure 2. Network Bandwidth Estimates TPC-C is an online-transaction processing (back-end) benchmark that is based on fulfilling database transactions in an order-entry environment. The purpose of using TPC-C here is to study the potential networking requirement for hack-end servers that access storage blocks over TCP/IP. The latest published results for TPC- C show that current day 4-processor systems may achieve up to 70,000 to 80,000 tpmc. Based on measurements in our labs, we have estimated the number of storage I/O accesses per transaction as a function of the overall throughput. Assuming the storage access is done over IP,! TCP/IP processing can he generally divided into two parts: (I) connection management and (2) data path processing. Connection management involves setting up and tearing down the connections. The TCP/IP data path processing consists of transmit and receive paths; both of which can be quite different in their processing requirements as we will show below. Our focus in this paper is on the data path processing characteristics of the TCP/IP stack. For detailed discussion on the TCP/IP protocol, please refer to [ 191. Figure 3 shows the architecture of the Microsotl Windows TCP/IP stack. User-level applications communicate to the windows sockets layer which in turn communicates with the Transport Device Interface (TDI). TDI is a proprietary interface on top of the TCPnP protocol driver providing a consistent interface by abstracting the layers above from various possible versions of the TCP/lP protocol. The Network Device Interface Specification (NDIS) provides a communication channel for the TCP/IP protocol driver to talk to the network interface devices. It also provides a framework and a driver model for different types of network interfaces. For more information on Microsott s implementation of the TCP/IP stack, please refer to [SI. Next, we describe the TCPlIP transmit and receive paths in some more detail. The steps described for transmit and receive paths are general and do not necessarily represent a particular stack implementation. The operations required in the transmit path are illustrated in Figure 4. When the application needs to transfer data, it passes a buffer containing data to the TCP/IP stack. The stack updates the TCP Control Block (TCB) associated with the socket. If the socket send buffer is set to a value greater than zero and if it has space available, then the stack will copy the application s data into this buffer. If it is set to 35

4 ? Figure 3. Microsoft Windows* TCP/IP Stack In case of receive, the NIC receives data, updates the receive descriptors supplied by the NIC driver with the checksum and other information and copies the descriptors and data into the memory (using DMA). The NIC will interrupt the processor after the DMA is done. Typically NlCs support a feature where they interrupt the processor either after accumulating certain number of packets or after certain time has elapsed since the last interrupt. This feature reduces the numbers of interrupts to the processor which are expensive. in terms of the computing power. This receive path is illustrated in Figure 5. The registered NIC device driver receives the interrupt and passes the buffer containing the data to the TCP/IP stack. The stack reads the TCP and IP header information from the buffer and updates the TCB associated with the socket for which this data was received. Based on the implementation and receive TCP window size, the stack determines when to send the acknowledgement for the received data. The data is then either copied into the socket receive buffer associated with that connection or placed directly into the application buffer if available. Figure 4. Details of the TCP/IP Transmit Path zero, then the stack will attempt to transmit the data directly from the application s buffer. In either case, the stack will hold on to the buffer with data until it gets an acknowledgement from the receiver indicating that the receiver has received the data without any problem. The send side TCP/IP stack keeps track of the receiver window size which indicates how much data the receiver is willing to accept at once. For small application buffer sizes, less than the Ethernet Frame size, this buffering is required to be turned on to get any reasonable throughput because it allows the stack to coalesce application s data while waiting for acknowledgement for previously transmitted data. This will reduce the number of small packets on the wire and hence the per packet overhead. If the application buffer is larger than the Maximum Segment Size (MSS) of the path then either the stack or the NIC (if LSO is on) will do the segmentation of the large buffers so that they do not get refused by a router on the path. For information on how the MSS is calculated, please refer to the TCP/IP RFC (191. The stack will compute the headers for the segments and passes them to the Ethemet driver which appends the Ethemet header to the TCP/IP header before transmitting the data and headers to the NIC using DMA. Figure S. Details of the TCP/IP Receive Path 111. MEASUREMENT-BASED CHARACTERIZATION In this section, we describe the methodology and tools used to evaluate TCP/IP packet processing characteristics and performance. A. Overview ofnttlcp NTttcp is a Microsoft* command-line sockets based tool based on the TTCP benchmark [26]. These tools are routinely used for measuring TCP and UDP performance between two end systems. When run, this tool outputs the throughput achieved, CPU utilization, number of interrupts generated by the NIC, packet size and etc. values. The NTttcp transmit-side achieves high network throughput by filling an application buffer with data, then repeatedly transmitting this data to a receiving node. Since it is largely running from memory, NTttcp thereby 36

5 enables a traffic transmitter and a receiver that operate at true network speeds. NTttcp also provides command line parameters to enable various available stack features as will be described in a subsequent section on TCPIIP modes of operation. It should be noted that NTttcp only stresses the data transfer portion of TCPIIP connections and does not cover connection management (setups and teardowns). The analysis of the processing requirements of connection management is not within the scope of this paper -- we are planning to study this in the future using a traffic generator called GEIST [15]. B. Overview of System Configurations Our mesurement configuration consists of the following systems - (1) the system under test (SUT) and (2) two clients (for source or sink during receive or transmit experiments). This configuration is illustrated in Figure 6. All three platforms run the Microsoft Windows* Server 2003 Enterprise Edition (RC2 version). The system under test consists of a PentiumO M processor running at 1.6 GHz and a 400 MHz Front Side Bus (FSB), the E7501 chipset with DDR-200 SDRAM memory, two PCI-X slots (133MHz and IOOMHz) and two dual-ported IGbit Intel Pro NICs. Two 4-way Intel@ ltanium@ 2 machines are used as the clients so that they won't become the bottlenecks when receiving or transmitting the data. The SUT system is connected to each client over the dual-ported NIC. This gives a total bandwidth of 4Ghps to the SUT. We measured the network throughput when transmitting or receiving buffer sizes ranging from 64 byte to 65536hytes over sixteen NTttcp streams using a single Pentium@ M processor. For the transmit test, the transmit version of NTttcp were run on the system under test and the receive versions were run on the sink servers. For the receive test, the configuration is the exact opposite. It is always ensured that the system is at steady state before any measurements are taken. el pemium M UP Pldform Source or Sink T w Sewen Figure 6. Measurement Configurations C. Performance Monitoring Tools The NTttcp tool provides the achieved throughput (Mbitds), the CPU utilization and several other performance data such as packets sentlreceived, interrupts per sec and cycles per byte. For a detailed characterization of the architectural components, we used the in-built performance monitoring events in the Pentium@ M processor. The event monitoring hardware provides several facilities including simple event counting, time-based sampling, event sampling and branch tracing which are extracted by an Intel@ Corp. developed tool called EMON. A detailed explanation of these techniques is not within the scope of this paper. Some the key processor events that were used in our performance analysis include the number of retired instructions, the un halted CPU cycles, the number of branches and the number of cache and TLB misses. Each event was sampled for a period of 10 seconds. We made sure that each test lasted long enough to allow data collection of the monitored events. D. Measured Modes of TCP/IP Operation There are several settings available to tune the TCP/IP stack to optimize performance for a given workload. These settings typically fall into two camps; 1) available NIC features and 2) available stack features. Stack Features: * Overlapped I10 with completion ports: This is the preferred way of performing network I10 in Microsofl Windows. 01s for the applications that need to support large number of connections. In this mode, applications post one or more buffers to send and receive data from the network and wait for completion messages to arrive in the completion port. Completion ports are an. efficient mechanism on Microsofl Windows; for sending notifications to the applications on the status of the I/O operations. The OIS locks the memory pages associated with these posted buffers so that they do not get swapped out. This allows TCPIIP to copy the data directly from the application buffer to the NIC using DMA for transmits and into the application buffer from the NIC buffer for receives, thus avoiding an extra memory copy. * SendReceive Buffers: The send and receive buffers dictate the buffer management of the TCPIIP stack. Every TCPIIP stack uses some default values for these parameters but applications can override these values for each socket. These parameters specify the amount of buffer space to be used by the TCP/IP stack for each socket to temporarily store the data. If the send or receive buffer is set to zero then the application has to make sure that it provides buffers fast enough to get the maximum throughput. 37

6 ~.. i NIC Features: * Large Segment Offload (LSO): This feature allows the TCPllP stack to send large segments (> MTU) of data to the NIC and let, the NIC break these large segments into MTU sized segments and compute TCPAP header information for these. TCPnP stack computes a single header for, the entire segment and passes that to the NIC along with the data. NIC uses this header information to compute the headers for the smaller segments. Obviously, this feature is helpful only on the transmit side and that too for application buffer sizes greater than the MTU. * Jumbo Frames: This feature allows the NIC to increase its maximum Ethemet frame size from 1514 to a higher value (4088, 9014, etc). This.should be enabled on both ends of the network as well as on all the intermediate nodes for,it to work. That is why the Jumbo frames are not common on the Intemet. But this can be used in autonomous networks, say in an enterprise data center where this can be enabled between some systems. Applications such as storage and file servers which typically transfer large packets can benefit from this feature. higher throughput is achieved with the same or lower CPU utilization compared to the regular Ethemet frame case. In case of the Jumbo frames, the socket send buffer is set to 64KB for application buffer sizes less than 16KB, so an extra data copy occurs for these. The impact of this copy seems to offset the benefit of the Jumbo frame for 4K and 8K buffer sizes and hence the lower throughput. This is confirmed by the observed higher path length (instructionsibit) numbers over the regular Ethemet frame case for these three application buffer sizes. Figure 7@) shows the receive side throughput and corresponding CPU utilization for regular and Jumbo Ethemet frames. For the most part, the use of Jumbo frames has yielded higher throughput by up to 22% (at 2k byte application buffer). Most of this benefit is due to the higher number of bytes received per intermpt. Table I summarizes the configuration parameters of our test setup as described earlier. NmCP Sellines 11 ofstream (Connections) IS per client Amlicalion Buffer Sircr (TCPnP 164 bwes lo Mk bytes. in. MWCO of 2 I,. loverlaorrd WO wilh 2 completion WN Send and Rec~ive Socket Bvmr Size IAoo Buffcr Sizehnd IReceive ' (a) Transmit Throughput and Utilization.. ~. ~. ~.. TCPlP Qnrk"nca (RXI Table 1. Configuration Parameters for Measurements IV. ANALYSIS AND CHARACTERIZATION OF TCP/IP PACKET PROCESSING In this section, we first provide an overview of the TCPIIP performance in terms of the throughput and CPU utilization, as measured by the NTttcp tool. We then characterize the TCPilP processing by measuring and analyzing the information from various event counters. 61 II 2% I?? 102!Mm ma ldid 8?W me4,ma-... -rs'*'"q ~.~ ~.~ ~ (b) Receive Throughput and Utilization Figure 7. TCP/IP Performance Analysis, A. Overview of TCP/IP Performance Figure 7(a) shows the observed transmit side throughput and CPU utilization for both the regular and the Jumbo Ethemet frame sizes. In case of the Jumbo frames, we were able to achieve the 4Gbps peak line rate for application buffer sizes equal to 16k bytes and higher. Except for 64, 4K and 8K byte application buffer sizes, 38

7 . B. Bits per Hertz We wanted to address the generally held notion of Ibps/lHz, so we have plotted the Bits (transferred or received) per Hertz in Figure 8. We find that this rule of thumb may only be applicable for transmitting (TX) hulk data over a pre-established connection. We actually observe that when using regular frames and transmitting 32KB buffers, we may be able to achieve up to 5 bps for every Hz. For receive-side (RX), the situation seems dismal - for Q 32KB buffer size, we are hardly able to achieve I bps for every Hz. For large buffer sizes, the use of Jumbo Ethernet frames further improve the achievable bps / Hz. For example, at 32KB buffer size, the transmit rate increases from -5 bps to -8 bps. The change for the receive-side is not as significant (from 0.8 bps to I bps per Hz). For small buffer sizes (TX and RX), we achieve much lower (-0.06 bps/hz for 64 byte buffers and -0.3 bps/hz for 512 byte buffers) than 1 bps/hz. Here Jumbo frames do not help for obvious reasons. Overall, it should be kept in mind that the CPU frequency required for a achieving a line rate not only depends on the processor architecture family but also on the bandwidth and latency that the platform provides and the efficiency of the protocol stack implementation (number of instructions needed in the execution path etc). We start to delve into these aspects much deeper in the next few sections. For this we just focus on the regular Ethemet frame case. throughput numbers for transmit over receive. At small buffer sizes, however, the path length for TX and RX is quite similar as well as much lower (for example: at 256 byte buffer sizes with regular frames, it requires roughly 4200 instructions to receive a buffer and 4600 instructions to transmit a buffer). The next step is to understand the number of CPU cycles needed to issue, execute and retire an instmction (denoted as CPI). The CPI numbers for various TCP/IP buffer sizes are shown in Figure IO. The first observation is that the TX CPI is relatively flat as compared to the RX CPI, which increases considerably with increase in buffer size. The increase in RX CPI with the buffer size is due to the fact that RX processing involves at least one memory copy of the received data which causes cache misses. more memory dependent (more cache misses). The observed dip in the RX CPl for 2KB, 4KB and 32KB buffer sizes is due to a reduction in the number of L1 and L2 misses for these buffer sizes. We are further investigating the cause for lower cache misses for these three buffer sizes..~.- ~ Sy.i.ml."sl CPI : *.+ 6- P,c+,.Q *@ *.* *& 4* &@, Bumr911. ~. Figure 10. System-Level CPI D. Micro-architectural Characteristics,?B *" 4%.$+,@ Q ** +? 1 2 & S"(h,SlZ. Figure 9. PathLength of TCP/IP data transfer C. CPl and Pathlength Characteristics We start presenting the architectural characteristics for TCPlIP data path processing by looking at the number of instructions executed and retired in a given time for each application buffer transferred or received. This data (termed as path length or PL) was collected using performance counters on the PentiumB M processor and is shown in Figure 9. From the figure it is apparent that as the buffer size increases, the number of instructions needed (path length) also increases significantly. A comparison of TX and RX path lengths reveal that the receive path requirea significantly more instructions (almost 5X) at large buffer sizes. This explains the larger In addition to CPI and pathlength, we have also measured the micro-architectural events to better understand the behavior of the TCP/IP Transmit and Receive processing. The micro-architectural events include branch frequency and misprediction, cache misses at various levels in the hierarchy and TLB misses. The detailed data (on a per buffer basis) is shown in Table 2 and 3. It is apparent from the table that the values for each event increase substantially with increase in buffer size. We also note that the percentage of branches (w.r.t. overall instructions) is roughly 20% irrespective of buffer size and for both transmit and receive processing. When comparing receive-side processing to transmit-side processing, we find that the number of memory references are significantly higher on the receive-side. Overall, we also find that as the buffer size increases, the rise in cache misses, TLB misses and branch mispredicts per instruction is more rapid for the receive side and more gradual for transmit side. 39

8 (Fast-Path) Packet Procerslng In SBNBr Workloads Table 2; Transmit-Side Processing Characteristics lpgc 5198 Tpcw 8.nshmark.~.. ~.. ~. Figure 11. TCPRP Processing in Server Workloads Table 3. Receive-Side Processing Characteristics v. IMPACT ON COMMERCIAL SERVER WORKLOADS In the previous section, we discussed the overall processing requirements of TCP/IP fast path (data transfers). In this section, we relate these processing requirements to commercial server workloads based on measurements done in our labs. The average path length pes operation or transaction for SPECweb99, TPC-W and TPC-C including its TCPfiP data path component is illustrated in Figure II. Measurements and modeling experiments of SPECweb99 and TPC-W benchmarks in our lab have revealed that the overall application path length is roughly and instructions respectively. and roughly 1.3 million instructions (if the entire data-set is cached in memory) for TPC-C. In case of SPECweb99, the receive-side per operation only receives an HTTP request which fits easily within 128 bytes. We have shown earlier in the paper that receiving 128 byte buffers requires around 4000 instructions. The transmit side for SPECweb99 sends an average file of roughly 14KB. To send this, it requires approximately instructions. As a result, out of the instructions per operation, roughly 18,000 instructions (-28%) are related to packet processing. Similarly for TPC-W, based on the number of messages communicated between the web server and the other systems (clients and database serves), the estimated message sizes and the number of instructions required for processing these by the TCPlIP stack (using NTncp), we found that roughly instructions were required for the packet processing alone. Out of this, 64% (-62,000) is transmit-side and 36% is received-side. Also, majority of the communication (84%) is due to the communication between the clients and the servers. Overall, we find that almost 29% of the instructions executed by TPC-W web servers are related to packet processing. In the case of TPC-C, without any disk I/O accesses, we noted above that the application path length is roughly 1.3 million instructions. Each storage 110 operation (over IP) will transmit or receive a physical page. Assuming an average of 25 storage 110 accesses per transaction, the number of instructions ansihuted to packet processing works out to just below 700,000 instructions. As a result, the overall path length per transaction (with storage over IP) can be estimated at roughly 2 million instructions; out of which the packet processing component is 35%. VI. CONCLUSIONS AND FUTURE WORK In this paper, we presented a detailed analysis of TCPnP fast path performance. We accomplished this through detailed measurements on a state-of-the-art low power Intel@ PentiumQ M microprocessor. We also used the performance monitoring events on the PentiumB M to understand the architectural characteristics of TCP/IP packet processing. We first looked at the achievable performance and associated CPU utilization for transmit and receive processing from 64 byte to 64KB buffer sizes. We then evaluated architectural characteristics by studying CPI, path length, cache performance, TLB performance and branch misprediction rates. We showed that transmit performance is largely compute hound, whereas receive throughput can be memory bound. We then related the path length required for packet processing to the overall path length required for each transaction (or operation) in commercial server workloads (SPECweb99, TPC-W and TPC-C). We showed that roughly 35% of TPC-C, -28% 40

9 of SPECweb99 and -29% of TPC-W workload execution can constitute TCPAP packet processing. Future work in this area is multi-fold. Several potential areas of architectural optimization that may help accelerate TCPilP packet processing need to be investigated - these include network-aware data prefetching and forwarding, protocol stack (software) optimizations and determining the useful architectural features of packet processing engines. We also plan to study the performance characteristics of connection management (set-ups and teardowns). We plan to accomplish this by modifying a web traffic generator that we built called GEIST [15]. Finally, we would also like to accurately quantify the amount of packet processing in several other interesting applications (mid-tier applications and IPC-intensive applications). ACKNOWLEDGEMENTS We would like to express our thanks to Michael Espig for providing us the necessary system infrastnrcture to be able to performance these measurements. We would also like to Dave Minturn and Ramesh Illikkal for his insight into the TCP/IPprotocol stacks and other members of our team for their helpfll input on this study. Finally, we also thank Raed Kanjo. James Mead and Will Auld for providing us insights in the measurement and analysis of commercia1 workloads, NOTICES C3 is a trademark or registered trademark of Intel Corporation or its subsidiaries in the UnitedStates and other countries. * Other names and brands may be claimed as the properry of others. REFERENCES [I] Alacritech SLIC: A Data Path TCP Offload methodology, (21 J. Chase et. al., End System Optimizations for High-speed TCP, IEEE Communications, Special Issue on High- Speed TCP, June [3] D. Clark et. al., An analysis of TCP Processing overhead, IEEE Communications, June [4] D. Clark et. al., Architectural Considerations for a new generation of Protocols, ACM SIGCOMM, Sept [SI A. Earls, TCP Ofload Engines Finally Arrive, Storage Magazine, March [6] A. Foong et al., TCP Performance Analysis Re-visited, IEEE International Symposium on Performance Analysis of SoRware and Systems, March [7] S. Gochman, et al., The Intel@ Pentiuma M Processor: Microarchitecture and Performance. Intel Technology Journal. May [RI High Performance Network Adapters and Drivers in Windows, wdec/tech/, December [9] Y. Hoskote, et al., A IOGHz TCP Offload Accelerator for IOGbIs Ethernet in 90nm Dual-Vt CMOS, IEEE International Solid-State Circuits Conference (ISSCC). San Francisco, [IO] R. Huggahalli and R. lyer, Direct Cache Access for Coherent Network 110, submitted to an international conference, [ I] ISCSI, IP Storage Working Group, Internet Draft, available at 20.txt [I21 R. lyer, CASPER Cache Architecture Simulation and Performance Exploration using Refstreams, Intel Design and Test Technology Conference (DITC), July [I31 R. Iyer, On Modeling and Analyzing Cache Hierarchies using CASPER. I Ith Symposium on Modeling, Analysis and Simulation of Comauter and Telecommunication Systems, Oct K. Kant. TCP offload oerformance for front-end servers: to appear in Globecom, San Francisco, [IS] K. Kant, V. Tewari and R. lyer, GEIST- A Generator for E-commerce and Internet Server Trafic, 2001 IEEE International Symposium on Performance Analysis of Systems and SoRware (ISPASS), Oct 2001 [I61 H. Kim, V. Pai and S. Rixner, Increasing Web Server Throughput with Network Interface Data Caching, International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp , San Jose, CA, (October, 2002). [I71 D. McConnell, IP Storage: A Technology Overview, available at ipstorage.htm [I81 J. Mogul, TCP offload is a dumb idea whose time has come. A Svmaosium on Hot Oaeratinz _. Svstems (HOT OS), Stanford, i B. Postel. Transmission Control Protocol. RFC 793. Information Sciences Institute, Sept [ZO] M. Rangarajan et al., TCP Servers: Ofloading TCPIIP Processing in Internet Servers. Design, Implementation, and Performance, Rutgers University, Dept of Computer Science Technical Report, DCS-TR-481, March [21] G. Regnier et al., ETA: Experience with an Intel Xeon Processor as a Packet Processing Engine, A Symposium on High Performance Interconnects (HOT Interconnects), Stanford, [22] RDMA Consortium. [23] Remote Direct Data Placement Working Group. [241 P. Sarkar, S. Ummchandani, and K. Voruganti., Storage over IP: Does hardware support help? In Proc. 2nd USENIX Conf. on File and Storage Technologies, pages , CA, March [XI SPECweb99 Design Document, available online at hnp:// [261 The TTTCP Benchmark, [27] TPC-C Design Document, [28] TPC-W Design Document, [29] S. Shankland, HP to sharpen blade with Pentium M, IO html 41

Measurement-based Analysis of TCP/IP Processing Requirements

Measurement-based Analysis of TCP/IP Processing Requirements Measurement-based Analysis of TCP/IP Processing Requirements Srihari Makineni Ravi Iyer Communications Technology Lab Intel Corporation {srihari.makineni, ravishankar.iyer}@intel.com Abstract With the

More information

Efficient Caching Techniques for Server Network Acceleration

Efficient Caching Techniques for Server Network Acceleration Efficient Caching Techniques for Server Network Acceleration Li Zhao *, Srihari Makineni +, Ramesh Illikkal +, Ravi Iyer + and Laxmi Bhuyan * + Communications Technology Lab Intel Corporation * Department

More information

Intel Technology Journal

Intel Technology Journal Volume 7 Issue 4 Published, November 14, 23 ISSN 1535-864X Intel Technology Journal Communications Processing Addressing TCP/IP Processing Challenges using the IA and IXP Processors A compiled version

More information

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Impact of Cache Coherence Protocols on the Processing of Network Traffic Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background

More information

Introduction to TCP/IP Offload Engine (TOE)

Introduction to TCP/IP Offload Engine (TOE) Introduction to TCP/IP Offload Engine (TOE) Version 1.0, April 2002 Authored By: Eric Yeh, Hewlett Packard Herman Chao, QLogic Corp. Venu Mannem, Adaptec, Inc. Joe Gervais, Alacritech Bradley Booth, Intel

More information

440GX Application Note

440GX Application Note Overview of TCP/IP Acceleration Hardware January 22, 2008 Introduction Modern interconnect technology offers Gigabit/second (Gb/s) speed that has shifted the bottleneck in communication from the physical

More information

by Brian Hausauer, Chief Architect, NetEffect, Inc

by Brian Hausauer, Chief Architect, NetEffect, Inc iwarp Ethernet: Eliminating Overhead In Data Center Designs Latest extensions to Ethernet virtually eliminate the overhead associated with transport processing, intermediate buffer copies, and application

More information

6.9. Communicating to the Outside World: Cluster Networking

6.9. Communicating to the Outside World: Cluster Networking 6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and

More information

QuickSpecs. HP Z 10GbE Dual Port Module. Models

QuickSpecs. HP Z 10GbE Dual Port Module. Models Overview Models Part Number: 1Ql49AA Introduction The is a 10GBASE-T adapter utilizing the Intel X722 MAC and X557-AT2 PHY pairing to deliver full line-rate performance, utilizing CAT 6A UTP cabling (or

More information

Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances

Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances Technology Brief Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances The world

More information

Motivation CPUs can not keep pace with network

Motivation CPUs can not keep pace with network Deferred Segmentation For Wire-Speed Transmission of Large TCP Frames over Standard GbE Networks Bilic Hrvoye (Billy) Igor Chirashnya Yitzhak Birk Zorik Machulsky Technion - Israel Institute of technology

More information

Implementation and Analysis of Large Receive Offload in a Virtualized System

Implementation and Analysis of Large Receive Offload in a Virtualized System Implementation and Analysis of Large Receive Offload in a Virtualized System Takayuki Hatori and Hitoshi Oi The University of Aizu, Aizu Wakamatsu, JAPAN {s1110173,hitoshi}@u-aizu.ac.jp Abstract System

More information

The Case for RDMA. Jim Pinkerton RDMA Consortium 5/29/2002

The Case for RDMA. Jim Pinkerton RDMA Consortium 5/29/2002 The Case for RDMA Jim Pinkerton RDMA Consortium 5/29/2002 Agenda What is the problem? CPU utilization and memory BW bottlenecks Offload technology has failed (many times) RDMA is a proven sol n to the

More information

Multimedia Streaming. Mike Zink

Multimedia Streaming. Mike Zink Multimedia Streaming Mike Zink Technical Challenges Servers (and proxy caches) storage continuous media streams, e.g.: 4000 movies * 90 minutes * 10 Mbps (DVD) = 27.0 TB 15 Mbps = 40.5 TB 36 Mbps (BluRay)=

More information

An RDMA Protocol Specification (Version 1.0)

An RDMA Protocol Specification (Version 1.0) draft-recio-iwarp-rdmap-v.0 Status of this Memo R. Recio IBM Corporation P. Culley Hewlett-Packard Company D. Garcia Hewlett-Packard Company J. Hilland Hewlett-Packard Company October 0 An RDMA Protocol

More information

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c White Paper Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c What You Will Learn This document demonstrates the benefits

More information

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

Linux Network Tuning Guide for AMD EPYC Processor Based Servers Linux Network Tuning Guide for AMD EPYC Processor Application Note Publication # 56224 Revision: 1.00 Issue Date: November 2017 Advanced Micro Devices 2017 Advanced Micro Devices, Inc. All rights reserved.

More information

Advanced Computer Networks. End Host Optimization

Advanced Computer Networks. End Host Optimization Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct

More information

Intel Enterprise Processors Technology

Intel Enterprise Processors Technology Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology

More information

iscsi Technology: A Convergence of Networking and Storage

iscsi Technology: A Convergence of Networking and Storage HP Industry Standard Servers April 2003 iscsi Technology: A Convergence of Networking and Storage technology brief TC030402TB Table of Contents Abstract... 2 Introduction... 2 The Changing Storage Environment...

More information

Microsoft SQL Server 2012 Fast Track Reference Configuration Using PowerEdge R720 and EqualLogic PS6110XV Arrays

Microsoft SQL Server 2012 Fast Track Reference Configuration Using PowerEdge R720 and EqualLogic PS6110XV Arrays Microsoft SQL Server 2012 Fast Track Reference Configuration Using PowerEdge R720 and EqualLogic PS6110XV Arrays This whitepaper describes Dell Microsoft SQL Server Fast Track reference architecture configurations

More information

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades Evaluation report prepared under contract with Dot Hill August 2015 Executive Summary Solid state

More information

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication John Markus Bjørndalen, Otto J. Anshus, Brian Vinter, Tore Larsen Department of Computer Science University

More information

IsoStack Highly Efficient Network Processing on Dedicated Cores

IsoStack Highly Efficient Network Processing on Dedicated Cores IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single

More information

Reliable File Transfer

Reliable File Transfer Due date Wednesday, Mar 14, 11:59pm Reliable File Transfer CS 5565 Spring 2012, Project 2 This project is worth 100 points. You may form teams of up to two students for this project. You are not required

More information

Best Practices for Setting BIOS Parameters for Performance

Best Practices for Setting BIOS Parameters for Performance White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page

More information

Impact of bandwidth-delay product and non-responsive flows on the performance of queue management schemes

Impact of bandwidth-delay product and non-responsive flows on the performance of queue management schemes Impact of bandwidth-delay product and non-responsive flows on the performance of queue management schemes Zhili Zhao Dept. of Elec. Engg., 214 Zachry College Station, TX 77843-3128 A. L. Narasimha Reddy

More information

HP s Performance Oriented Datacenter

HP s Performance Oriented Datacenter HP s Performance Oriented Datacenter and Automation SEAH Kwang Leng Marketing Manager Enterprise Storage and Servers Asia Pacific & Japan 2008 Hewlett-Packard Development Company, L.P. The information

More information

Optimizing Performance: Intel Network Adapters User Guide

Optimizing Performance: Intel Network Adapters User Guide Optimizing Performance: Intel Network Adapters User Guide Network Optimization Types When optimizing network adapter parameters (NIC), the user typically considers one of the following three conditions

More information

Lighting the Blue Touchpaper for UK e-science - Closing Conference of ESLEA Project The George Hotel, Edinburgh, UK March, 2007

Lighting the Blue Touchpaper for UK e-science - Closing Conference of ESLEA Project The George Hotel, Edinburgh, UK March, 2007 Working with 1 Gigabit Ethernet 1, The School of Physics and Astronomy, The University of Manchester, Manchester, M13 9PL UK E-mail: R.Hughes-Jones@manchester.ac.uk Stephen Kershaw The School of Physics

More information

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK [r.tasker@dl.ac.uk] DataTAG is a project sponsored by the European Commission - EU Grant IST-2001-32459

More information

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA Evaluation report prepared under contract with HP Executive Summary The computing industry is experiencing an increasing demand for storage

More information

Large Receive Offload implementation in Neterion 10GbE Ethernet driver

Large Receive Offload implementation in Neterion 10GbE Ethernet driver Large Receive Offload implementation in Neterion 10GbE Ethernet driver Leonid Grossman Neterion, Inc. leonid@neterion.com Abstract 1 Introduction The benefits of TSO (Transmit Side Offload) implementation

More information

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup Chapter 4 Routers with Tiny Buffers: Experiments This chapter describes two sets of experiments with tiny buffers in networks: one in a testbed and the other in a real network over the Internet2 1 backbone.

More information

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. Munara Tolubaeva Technical Consulting Engineer 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. notices and disclaimers Intel technologies features and benefits depend

More information

Optimizing TCP Receive Performance

Optimizing TCP Receive Performance Optimizing TCP Receive Performance Aravind Menon and Willy Zwaenepoel School of Computer and Communication Sciences EPFL Abstract The performance of receive side TCP processing has traditionally been dominated

More information

Topic & Scope. Content: The course gives

Topic & Scope. Content: The course gives Topic & Scope Content: The course gives an overview of network processor cards (architectures and use) an introduction of how to program Intel IXP network processors some ideas of how to use network processors

More information

Use of the Internet SCSI (iscsi) protocol

Use of the Internet SCSI (iscsi) protocol A unified networking approach to iscsi storage with Broadcom controllers By Dhiraj Sehgal, Abhijit Aswath, and Srinivas Thodati In environments based on Internet SCSI (iscsi) and 10 Gigabit Ethernet, deploying

More information

White Paper The Need for a High-Bandwidth Memory Architecture in Programmable Logic Devices

White Paper The Need for a High-Bandwidth Memory Architecture in Programmable Logic Devices Introduction White Paper The Need for a High-Bandwidth Memory Architecture in Programmable Logic Devices One of the challenges faced by engineers designing communications equipment is that memory devices

More information

QuickSpecs. Integrated NC7782 Gigabit Dual Port PCI-X LOM. Overview

QuickSpecs. Integrated NC7782 Gigabit Dual Port PCI-X LOM. Overview Overview The integrated NC7782 dual port LOM incorporates a variety of features on a single chip for faster throughput than previous 10/100 solutions using Category 5 (or better) twisted-pair cabling,

More information

A Simulation: Improving Throughput and Reducing PCI Bus Traffic by. Caching Server Requests using a Network Processor with Memory

A Simulation: Improving Throughput and Reducing PCI Bus Traffic by. Caching Server Requests using a Network Processor with Memory Shawn Koch Mark Doughty ELEC 525 4/23/02 A Simulation: Improving Throughput and Reducing PCI Bus Traffic by Caching Server Requests using a Network Processor with Memory 1 Motivation and Concept The goal

More information

RDMA-like VirtIO Network Device for Palacios Virtual Machines

RDMA-like VirtIO Network Device for Palacios Virtual Machines RDMA-like VirtIO Network Device for Palacios Virtual Machines Kevin Pedretti UNM ID: 101511969 CS-591 Special Topics in Virtualization May 10, 2012 Abstract This project developed an RDMA-like VirtIO network

More information

Supra-linear Packet Processing Performance with Intel Multi-core Processors

Supra-linear Packet Processing Performance with Intel Multi-core Processors White Paper Dual-Core Intel Xeon Processor LV 2.0 GHz Communications and Networking Applications Supra-linear Packet Processing Performance with Intel Multi-core Processors 1 Executive Summary Advances

More information

An Extensible Message-Oriented Offload Model for High-Performance Applications

An Extensible Message-Oriented Offload Model for High-Performance Applications An Extensible Message-Oriented Offload Model for High-Performance Applications Patricia Gilfeather and Arthur B. Maccabe Scalable Systems Lab Department of Computer Science University of New Mexico pfeather@cs.unm.edu,

More information

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Evaluating the Impact of RDMA on Storage I/O over InfiniBand Evaluating the Impact of RDMA on Storage I/O over InfiniBand J Liu, DK Panda and M Banikazemi Computer and Information Science IBM T J Watson Research Center The Ohio State University Presentation Outline

More information

Best Practices for Deploying a Mixed 1Gb/10Gb Ethernet SAN using Dell Storage PS Series Arrays

Best Practices for Deploying a Mixed 1Gb/10Gb Ethernet SAN using Dell Storage PS Series Arrays Best Practices for Deploying a Mixed 1Gb/10Gb Ethernet SAN using Dell Storage PS Series Arrays Dell EMC Engineering December 2016 A Dell Best Practices Guide Revisions Date March 2011 Description Initial

More information

Network Adapter. Increased demand for bandwidth and application processing in. Improve B2B Application Performance with Gigabit Server

Network Adapter. Increased demand for bandwidth and application processing in. Improve B2B Application Performance with Gigabit Server Improve B2B Application Performance with Gigabit Server Network Adapter By Uri Elzur Business-to-business (B2B) applications and gigabit networking speeds increase the load on server CPUs. These challenges

More information

Performance Analysis of iscsi Middleware Optimized for Encryption Processing in a Long-Latency Environment

Performance Analysis of iscsi Middleware Optimized for Encryption Processing in a Long-Latency Environment Performance Analysis of iscsi Middleware Optimized for Encryption Processing in a Long-Latency Environment Kikuko Kamisaka Graduate School of Humanities and Sciences Ochanomizu University -1-1, Otsuka,

More information

Performance Comparisons of Dell PowerEdge Servers with SQL Server 2000 Service Pack 4 Enterprise Product Group (EPG)

Performance Comparisons of Dell PowerEdge Servers with SQL Server 2000 Service Pack 4 Enterprise Product Group (EPG) Performance Comparisons of Dell PowerEdge Servers with SQL Server 2000 Service Pack 4 Enterprise Product Group (EPG) Dell White Paper By Neelima Chinthamani (Enterprise OS Releases) Ravikanth Chaganti

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Multifunction Networking Adapters

Multifunction Networking Adapters Ethernet s Extreme Makeover: Multifunction Networking Adapters Chuck Hudson Manager, ProLiant Networking Technology Hewlett-Packard 2004 Hewlett-Packard Development Company, L.P. The information contained

More information

RiceNIC. A Reconfigurable Network Interface for Experimental Research and Education. Jeffrey Shafer Scott Rixner

RiceNIC. A Reconfigurable Network Interface for Experimental Research and Education. Jeffrey Shafer Scott Rixner RiceNIC A Reconfigurable Network Interface for Experimental Research and Education Jeffrey Shafer Scott Rixner Introduction Networking is critical to modern computer systems Role of the network interface

More information

1-Gigabit TCP Offload Engine

1-Gigabit TCP Offload Engine White Paper 1-Gigabit TCP Offload Engine Achieving greater data center efficiencies by providing Green conscious and cost-effective reductions in power consumption. July 2009 Third party information brought

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

MPA (Marker PDU Aligned Framing for TCP)

MPA (Marker PDU Aligned Framing for TCP) MPA (Marker PDU Aligned Framing for TCP) draft-culley-iwarp-mpa-01 Paul R. Culley HP 11-18-2002 Marker (Protocol Data Unit) Aligned Framing, or MPA. 1 Motivation for MPA/DDP Enable Direct Data Placement

More information

HP BladeSystem c-class Ethernet network adaptors

HP BladeSystem c-class Ethernet network adaptors HP BladeSystem c-class Ethernet network adaptors Family data sheet NC325m Quad-port Gigabit NC326m Dual-port Gigabit NC360m Dual-port Gigabit NC364m Quad-port Gigabit NC382m Dual-port Multifunction Gigabit

More information

HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads

HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads Gen9 server blades give more performance per dollar for your investment. Executive Summary Information Technology (IT)

More information

Power Measurement Using Performance Counters

Power Measurement Using Performance Counters Power Measurement Using Performance Counters October 2016 1 Introduction CPU s are based on complementary metal oxide semiconductor technology (CMOS). CMOS technology theoretically only dissipates power

More information

QuickSpecs. Models. HP NC510C PCIe 10 Gigabit Server Adapter. Overview

QuickSpecs. Models. HP NC510C PCIe 10 Gigabit Server Adapter. Overview Overview The NC510C is a x8 PCI Express (PCIe) 10 Gigabit Ethernet CX4 (10GBASE-CX4 copper) network solution offering the highest bandwidth available in a ProLiant Ethernet adapter. This high-performance,

More information

Xen Network I/O Performance Analysis and Opportunities for Improvement

Xen Network I/O Performance Analysis and Opportunities for Improvement Xen Network I/O Performance Analysis and Opportunities for Improvement J. Renato Santos G. (John) Janakiraman Yoshio Turner HP Labs Xen Summit April 17-18, 27 23 Hewlett-Packard Development Company, L.P.

More information

QuickSpecs. Models. HP NC380T PCI Express Dual Port Multifunction Gigabit Server Adapter. Overview

QuickSpecs. Models. HP NC380T PCI Express Dual Port Multifunction Gigabit Server Adapter. Overview Overview The HP NC380T server adapter is the industry's first PCI Express dual port multifunction network adapter supporting TOE (TCP/IP Offload Engine) for Windows, iscsi (Internet Small Computer System

More information

Design and Performance Evaluation of Networked Storage Architectures

Design and Performance Evaluation of Networked Storage Architectures Design and Performance Evaluation of Networked Storage Architectures Xubin He (Hexb@ele.uri.edu) July 25,2002 Dept. of Electrical and Computer Engineering University of Rhode Island Outline Introduction

More information

QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER

QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER Hardware Sizing Using Amazon EC2 A QlikView Scalability Center Technical White Paper June 2013 qlikview.com Table of Contents Executive Summary 3 A Challenge

More information

Connection Handoff Policies for TCP Offload Network Interfaces

Connection Handoff Policies for TCP Offload Network Interfaces Connection Handoff Policies for TCP Offload Network Interfaces Hyong-youb Kim and Scott Rixner Rice University Houston, TX 77005 {hykim, rixner}@rice.edu Abstract This paper presents three policies for

More information

Exchange Server 2007 Performance Comparison of the Dell PowerEdge 2950 and HP Proliant DL385 G2 Servers

Exchange Server 2007 Performance Comparison of the Dell PowerEdge 2950 and HP Proliant DL385 G2 Servers Exchange Server 2007 Performance Comparison of the Dell PowerEdge 2950 and HP Proliant DL385 G2 Servers By Todd Muirhead Dell Enterprise Technology Center Dell Enterprise Technology Center dell.com/techcenter

More information

Benefits of Automatic Data Tiering in OLTP Database Environments with Dell EqualLogic Hybrid Arrays

Benefits of Automatic Data Tiering in OLTP Database Environments with Dell EqualLogic Hybrid Arrays TECHNICAL REPORT: Performance Study Benefits of Automatic Data Tiering in OLTP Database Environments with Dell EqualLogic Hybrid Arrays ABSTRACT The Dell EqualLogic hybrid arrays PS6010XVS and PS6000XVS

More information

Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage

Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage Database Solutions Engineering By Raghunatha M, Ravi Ramappa Dell Product Group October 2009 Executive Summary

More information

The QLogic 8200 Series is the Adapter of Choice for Converged Data Centers

The QLogic 8200 Series is the Adapter of Choice for Converged Data Centers The QLogic 82 Series is the Adapter of QLogic 1GbE Converged Network Adapter Outperforms Alternatives in Dell 12G Servers QLogic 82 Series Converged Network Adapter outperforms the alternative adapter

More information

Daniel A. Menascé, Ph. D. Dept. of Computer Science George Mason University

Daniel A. Menascé, Ph. D. Dept. of Computer Science George Mason University Daniel A. Menascé, Ph. D. Dept. of Computer Science George Mason University menasce@cs.gmu.edu www.cs.gmu.edu/faculty/menasce.html D. Menascé. All Rights Reserved. 1 Benchmark System Under Test (SUT) SPEC

More information

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate NIC-PCIE-1SFP+-PLU PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate Flexibility and Scalability in Virtual

More information

The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook)

The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook) Workshop on New Visions for Large-Scale Networks: Research & Applications Vienna, VA, USA, March 12-14, 2001 The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook) Wu-chun Feng feng@lanl.gov

More information

TCP offload engines for high-speed data processing

TCP offload engines for high-speed data processing TCP offload engines for high-speed data processing TCP/IP over ethernet has become the most dominant packet processing protocol. Ethernet networks are now running at higher and higher speeds with the development

More information

Microsoft SQL Server in a VMware Environment on Dell PowerEdge R810 Servers and Dell EqualLogic Storage

Microsoft SQL Server in a VMware Environment on Dell PowerEdge R810 Servers and Dell EqualLogic Storage Microsoft SQL Server in a VMware Environment on Dell PowerEdge R810 Servers and Dell EqualLogic Storage A Dell Technical White Paper Dell Database Engineering Solutions Anthony Fernandez April 2010 THIS

More information

Evaluation of the Chelsio T580-CR iscsi Offload adapter

Evaluation of the Chelsio T580-CR iscsi Offload adapter October 2016 Evaluation of the Chelsio T580-CR iscsi iscsi Offload makes a difference Executive Summary As application processing demands increase and the amount of data continues to grow, getting this

More information

RoCE vs. iwarp Competitive Analysis

RoCE vs. iwarp Competitive Analysis WHITE PAPER February 217 RoCE vs. iwarp Competitive Analysis Executive Summary...1 RoCE s Advantages over iwarp...1 Performance and Benchmark Examples...3 Best Performance for Virtualization...5 Summary...6

More information

Performance Analysis in the Real World of Online Services

Performance Analysis in the Real World of Online Services Performance Analysis in the Real World of Online Services Dileep Bhandarkar, Ph. D. Distinguished Engineer 2009 IEEE International Symposium on Performance Analysis of Systems and Software My Background:

More information

COSC 6385 Computer Architecture - Memory Hierarchies (III)

COSC 6385 Computer Architecture - Memory Hierarchies (III) COSC 6385 Computer Architecture - Memory Hierarchies (III) Edgar Gabriel Spring 2014 Memory Technology Performance metrics Latency problems handled through caches Bandwidth main concern for main memory

More information

Introduction to OpenOnload Building Application Transparency and Protocol Conformance into Application Acceleration Middleware

Introduction to OpenOnload Building Application Transparency and Protocol Conformance into Application Acceleration Middleware White Paper Introduction to OpenOnload Building Application Transparency and Protocol Conformance into Application Acceleration Middleware Steve Pope, PhD Chief Technical Officer Solarflare Communications

More information

NIC TEAMING IEEE 802.3ad

NIC TEAMING IEEE 802.3ad WHITE PAPER NIC TEAMING IEEE 802.3ad NIC Teaming IEEE 802.3ad Summary This tech note describes the NIC (Network Interface Card) teaming capabilities of VMware ESX Server 2 including its benefits, performance

More information

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Using Synology SSD Technology to Enhance System Performance Synology Inc. Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_WP_ 20121112 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges... 3 SSD

More information

Enabling Efficient and Scalable Zero-Trust Security

Enabling Efficient and Scalable Zero-Trust Security WHITE PAPER Enabling Efficient and Scalable Zero-Trust Security FOR CLOUD DATA CENTERS WITH AGILIO SMARTNICS THE NEED FOR ZERO-TRUST SECURITY The rapid evolution of cloud-based data centers to support

More information

Performance Enhancement for IPsec Processing on Multi-Core Systems

Performance Enhancement for IPsec Processing on Multi-Core Systems Performance Enhancement for IPsec Processing on Multi-Core Systems Sandeep Malik Freescale Semiconductor India Pvt. Ltd IDC Noida, India Ravi Malhotra Freescale Semiconductor India Pvt. Ltd IDC Noida,

More information

An FPGA-Based Optical IOH Architecture for Embedded System

An FPGA-Based Optical IOH Architecture for Embedded System An FPGA-Based Optical IOH Architecture for Embedded System Saravana.S Assistant Professor, Bharath University, Chennai 600073, India Abstract Data traffic has tremendously increased and is still increasing

More information

Performance Tuning VTune Performance Analyzer

Performance Tuning VTune Performance Analyzer Performance Tuning VTune Performance Analyzer Paul Petersen, Intel Sept 9, 2005 Copyright 2005 Intel Corporation Performance Tuning Overview Methodology Benchmarking Timing VTune Counter Monitor Call Graph

More information

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

Linux Network Tuning Guide for AMD EPYC Processor Based Servers Linux Network Tuning Guide for AMD EPYC Processor Application Note Publication # 56224 Revision: 1.10 Issue Date: May 2018 Advanced Micro Devices 2018 Advanced Micro Devices, Inc. All rights reserved.

More information

Advanced RDMA-based Admission Control for Modern Data-Centers

Advanced RDMA-based Admission Control for Modern Data-Centers Advanced RDMA-based Admission Control for Modern Data-Centers Ping Lai Sundeep Narravula Karthikeyan Vaidyanathan Dhabaleswar. K. Panda Computer Science & Engineering Department Ohio State University Outline

More information

InfiniBand SDR, DDR, and QDR Technology Guide

InfiniBand SDR, DDR, and QDR Technology Guide White Paper InfiniBand SDR, DDR, and QDR Technology Guide The InfiniBand standard supports single, double, and quadruple data rate that enables an InfiniBand link to transmit more data. This paper discusses

More information

Benefits of I/O Acceleration Technology (I/OAT) in Clusters

Benefits of I/O Acceleration Technology (I/OAT) in Clusters Benefits of I/O Acceleration Technology (I/OAT) in Clusters K. VAIDYANATHAN AND D. K. PANDA Technical Report Ohio State University (OSU-CISRC-2/7-TR13) The 27 IEEE International Symposium on Performance

More information

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC white paper FlashGrid Software Intel SSD DC P3700/P3600/P3500 Topic: Hyper-converged Database/Storage FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC Abstract FlashGrid

More information

High-Performance IP Service Node with Layer 4 to 7 Packet Processing Features

High-Performance IP Service Node with Layer 4 to 7 Packet Processing Features UDC 621.395.31:681.3 High-Performance IP Service Node with Layer 4 to 7 Packet Processing Features VTsuneo Katsuyama VAkira Hakata VMasafumi Katoh VAkira Takeyama (Manuscript received February 27, 2001)

More information

ibench: Quantifying Interference in Datacenter Applications

ibench: Quantifying Interference in Datacenter Applications ibench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos Kozyrakis Stanford University IISWC September 23 th 2013 Executive Summary Problem: Increasing utilization

More information

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research Fast packet processing in the cloud Dániel Géhberger Ericsson Research Outline Motivation Service chains Hardware related topics, acceleration Virtualization basics Software performance and acceleration

More information

Virtualization, Xen and Denali

Virtualization, Xen and Denali Virtualization, Xen and Denali Susmit Shannigrahi November 9, 2011 Susmit Shannigrahi () Virtualization, Xen and Denali November 9, 2011 1 / 70 Introduction Virtualization is the technology to allow two

More information

QuickSpecs. HPE Ethernet 1Gb Adapters HPE ProLiant DL, ML & Apollo. Overview

QuickSpecs. HPE Ethernet 1Gb Adapters HPE ProLiant DL, ML & Apollo. Overview Overview The HPE Ethernet 1Gb adapters deliver full line-rate performance across all ports with low power consumption, providing Ethernet connectivity ideal for virtualization, security, server consolidation,

More information

Analysis of HTTP Performance

Analysis of HTTP Performance Analysis of HTTP Performance Joe Touch, John Heidemann, and Katia Obraczka USC/Information Sciences Institute June 24, 1996 Initial Release, V1.1 Abstract: We discuss the performance effects of using per-transaction

More information

PERFORMANCE IMPROVEMENT OF AN ISCSI-BASED SECURE STORAGE ACCESS

PERFORMANCE IMPROVEMENT OF AN ISCSI-BASED SECURE STORAGE ACCESS PERFORMANCE IMPROVEMENT OF AN I-BASED SECURE STORAGE ACCESS Kikuko Kamisakay, Saneyasu Yamaguchiz, Masato Oguchiy y Graduate School of Humanities and Sciences Ochanomizu University 2-1-1, Otsuka, Bunkyo-ku,

More information

NOTE: A minimum of 1 gigabyte (1 GB) of server memory is required per each NC510F adapter. HP NC510F PCIe 10 Gigabit Server Adapter

NOTE: A minimum of 1 gigabyte (1 GB) of server memory is required per each NC510F adapter. HP NC510F PCIe 10 Gigabit Server Adapter Overview The NC510F is an eight lane (x8) PCI Express (PCIe) 10 Gigabit Ethernet SR (10GBASE-SR fiber optic) network solution offering the highest bandwidth available in a ProLiant Ethernet adapter. The

More information

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation SPDK China Summit 2018 Ziye Yang Senior Software Engineer Network Platforms Group, Intel Corporation Agenda SPDK programming framework Accelerated NVMe-oF via SPDK Conclusion 2 Agenda SPDK programming

More information

Appendix B. Standards-Track TCP Evaluation

Appendix B. Standards-Track TCP Evaluation 215 Appendix B Standards-Track TCP Evaluation In this appendix, I present the results of a study of standards-track TCP error recovery and queue management mechanisms. I consider standards-track TCP error

More information