Performance Characterization of TCP/IP Packet Processing in Commercial Server Workloads

Size: px

Start display at page:

Download "Performance Characterization of TCP/IP Packet Processing in Commercial Server Workloads"

Dana Houston
5 years ago
Views:

Performance Characterization of TCP/IP Packet Processing in Commercial Server Workloads Srihari Makineni Ravi Iyer Network Architecture Laboratory Intel Corporation {sriharimakineni, ravishankar.

1 Performance Characterization of TCP/IP Packet Processing in Commercial Server Workloads Srihari Makineni Ravi Iyer Network Architecture Laboratory Intel Corporation {sriharimakineni, ravishankar. Abstract -- TCP/IP is the communication protocol o/choice /or many current and next generation server applicationr (web services, e-commerce, storage, etc.). As o result, the per/ormance o/these opplications can be heavily dependent on the eflcient TCP/IP packet processing within the termination nodes. Motivated by this. our work presented in this paper focusej on analyzing the underlying architectural characteristics o/ TCPIIP packet processing and quantiaing the computolional requirements of the TCPNP packet processing component within current server workloads. Our analysis and characterization methodology is based on in-depth mearurement experiments o/ TCPIIP packet processing performance on Intel 's slate-ojthe-art low-power Penlium(r) M microprocessor running the Microsot? Windows' Server 2003 operating system. We start by analyzing the impact o/nic/eatures such as Large Segment Ofload and the use u/jumbo/romes on TCP/IPpacket processing per/onnonce. We then show /hot the architecturol characteristics of transmit-side processitig (largely compurebound) ore signijicant/y different than receive-side processing (mostly memory-bound). Fino/& we quontza the computotional requirements /or sendingheceiving packets within commercial workloads (SPECweb99. TPC-C and TPC-W) and show that they can form a substantial component. I. INTRODUCTION As Intemet presence and connectivity continues to increase at a rapid pace, there is a growing need to increase the available end-to-end communication bandwidth and performance. TCP/IP [I91 over Ethemet is the most prevalent form of communication that exists today in various environments. Commercial web and e- commerce server benchmarks such as SPECweb99 and TPC-W illustrate the amount of communication that is needed between the clients and servers as well as among the multiple tiers of servers involved in processing the transaction. In addition, hack-end database workloads like TPC-C exhibit significant storage I/O that translates into network traffic when using IP-based storage servers. One of the well-recognized issues with TCPlIP communication is the significant computational requirements of protocol stack processing at the termination nodes (clients and servers). However, there are very'few architectural studies in the literature [3,4, 61 that present details on the performance characteristics and computational requirements of TCP/IP protocol processing. Moreover, there seems be very little information relating packet processing to such commercial server workloads. In related work, several research projects contemplate the potential of using a TCPflP offload engine (TOE) or a packet processing engine (PPE) [I, 2, 5,9, 14,20,21] to offload protocol stack processing from the host CPU. The relevance of packet processing [17, 18, 241 grows stronger as Storage over IP becomes popular with the help of working groups for iscs1 [I I], RDMA [22] and DDP [23]. While these efforts provide valuable information, the study of micro-architectural characteristics and bottlenecks in stack processing is critical to help assist the architectural considerations and design choices for a PPE. Another direction of research that is equally important is the need to identify architectural improvements to general purpose cores that could also help improve the efficiency and performance of packet processing. The overall contributions of this paper are as follows. We perform an in-depth measurement and evaluation of packet processing on the state-of-the-art low power Intel@ PentiumB M microprocessor [7] running the Microsott's Server 2003 Enterprise Edition operating system. We chose the PentiumO M processor as it is gaining momentum in.the low power server blade segment [29]. The primary focus of this study is to do the following: (a) to expose the performance / architectural characteristics of packet processing and (b) to quantify the computational requirements of the TCP/IP packet processing component within server workloads. In this paper, we start by describing the measurement-based methodology used to achieve these goals. We then present an in-depth analysis of the performance data (CPI, path length, cache misses, branch prediction accuracy, etc) collected through application-level measurements and the use of in-built performance counters available on the Pentium@ M microprocessor. Finally, we compare the requirements of packet processing to the overall characteristics of commercial server workloads (SPECweb99, TPC-C and TPC-W) and discuss the potential opportunities for optimizing packet processing performance /03/$ IEEE 33

2 The rest of,this paper is organized as follows. Section 11 discusses the server networking requirements and presents an overview of TCP/IP packet processing components. Section 111 describes the detailed measurement-based methodology that we employed. Section IV presents an overview of TCPlIP packet processing performance and delves into the architectural characteristics of both transmit-side and receive-side processing. Section V relates the packet processing requirements to commercial workload execution. Section VI summarizes our inferences and concludes with a direction towards future work in this area. 37E (#I fscw.tm~*m *IY,.,IIY~*. /.w.. I(..-. ow i*voy rm*e*wa*d -.o* 8h.I 11. TCP/IP PACKET PROCESSING IN SERVER WORKLOADS In this section, we start by discussing the networking requirements for server workloads to motivate the need to characterize TCPlIP packet processing performance. We then discuss the components of interest when evaluating TCP/IP packet processing. A. Server Networking Requirements We have chosen three commercial server workloads (SPECweb99 [25], TPC-W [28] and TPC-C [27]) that are considered representative of front-end and back-end server segments. Figure 1 illustrates the types of incoming/outgoing network traffic for these workloads. SPECweb99 (251 is a benchmark that mimics the load in a web server environment. The benchmark setup uses multiple client systems to generate aggregate load (70% static and 30% dynamic) to a web server. SPECweb99 also makes use of persistent connections in order to minimize connect setups and teardowns. The latest results at the SPEC site ( show that current DP servers achieve 5000 to 6000 simultaneous connections. A primary requirement for SPECweh99 is that the network rate needs to he within 320Kbps and 400Kbps per connection. Based on measurement and tuning experiments' in our labs, we find that the rate that we generally observe on a DP Xeon system is 33Okbps. Based on this estimate, Figure 2(a) shows the network bandwidth requirements for the SPECweb99 benchmark as a function of the achieved performance (number of simultaneous connections). The relationship of network rate to performance is basically linear with respect to performance. It should also be noted that SPECweb99's network traffic is dominant on' the transmit-side, with an average application buffer size (average file size) ofroughly 14K bytes. TPC-W [28] is an e-commerce benchmark that models an I It shavld br nofd hi ulc cmw=#a! bmrhmh exg"mrnlr chat we performed in w lab do MI mec-dy mm the 8rlricL 'cponing I auditing m h nor do thri fully conform ID Ihr $wibrnion. Th& usz in IhV wgr is only m vndrnlinding mme hhnvbnl rhmcfrrinkr mdihr=f~ihovldbslrciledurlnh. CI(PC.Csd.*F.r. Figure 1. Networking Aspect of Server Workloads online hook store. The benchmark setup uses multiple client systems (remote browser emulators) to generate transactions that are typically seen in an e-commerce set up consisting of multiple servers (web servers, image servers and hack-end database servers). TPC-W specifies 14 different types of web interactions that require different amounts of processing on the system(s) under test (SUT). Additionally, it should be noted that each interaction requires multiple requests and responses (for the HTML frames, the images and others) between the clients, the web server, the image server and the back-end database server. The performance is measured in terms of the number of web interactions performed per second (WIPS) with some constraints on response time (specified on a per transaction basis) and the scale factor specified in terms of the number of items in the bookstore. The minimum size of the HTML code returned by the TPC-W front-end server averages around 3.3KB (ranging from slightly under IKB to around 8KB). In addition, there are several images (thumbnails, buttons, and detailed product images) that range from 4KB to 20KB in size. TPC-W experiments in our lab (for a scale factor of items) show that the SUT receives roughly 2.5KB of network data from the clients, while it transmits around 46KB of data to the clients. Typically, the front-end SUT configuration consists of a web server and an image server. Based on TPC-W measurement experiments', the outgoing network data is broken up into roughly 24KB per WIP transmitted by the web server and 22KB per WIP transmitted by the image server. The breakup of incoming network data from the clients is around 1.6KB to the web server and 0.9KB to the image server. In addition to the communication to the clients, the web server also communicates with the database server to query the bookstore database. In Figure 2 (b), we focus on the TPC-W web server and show how the network bandwidth requirement changes with increase in performance. 34

3 Figure 2(c) shows the potential networking requirements as a function of performance (specified in transactions per minute - tpmc). It should be noted that the storage I/Os per transaction depends not only on the performance level but also on the memory capacity available on the platform. We should also note that the application buffer size received is basically a physical page. 2.c B. Overview oftcp/ip Communicalion Having looked at the networking requirements of commercial server workloads, we now turn our attention to the TCPnP communication protocol since it is essentially the protocol of choice for most of these environments. In this subsection, we look at the components of TCP/IP processing and the modes of operations available on current network cards. (b) TPC-W-Based Estimates P.*,mm- VMCI. ~..~~. (c) TPC-C-Based Estimates Figure 2. Network Bandwidth Estimates TPC-C is an online-transaction processing (back-end) benchmark that is based on fulfilling database transactions in an order-entry environment. The purpose of using TPC-C here is to study the potential networking requirement for hack-end servers that access storage blocks over TCP/IP. The latest published results for TPC- C show that current day 4-processor systems may achieve up to 70,000 to 80,000 tpmc. Based on measurements in our labs, we have estimated the number of storage I/O accesses per transaction as a function of the overall throughput. Assuming the storage access is done over IP,! TCP/IP processing can he generally divided into two parts: (I) connection management and (2) data path processing. Connection management involves setting up and tearing down the connections. The TCP/IP data path processing consists of transmit and receive paths; both of which can be quite different in their processing requirements as we will show below. Our focus in this paper is on the data path processing characteristics of the TCP/IP stack. For detailed discussion on the TCP/IP protocol, please refer to [ 191. Figure 3 shows the architecture of the Microsotl Windows TCP/IP stack. User-level applications communicate to the windows sockets layer which in turn communicates with the Transport Device Interface (TDI). TDI is a proprietary interface on top of the TCPnP protocol driver providing a consistent interface by abstracting the layers above from various possible versions of the TCP/lP protocol. The Network Device Interface Specification (NDIS) provides a communication channel for the TCP/IP protocol driver to talk to the network interface devices. It also provides a framework and a driver model for different types of network interfaces. For more information on Microsott s implementation of the TCP/IP stack, please refer to [SI. Next, we describe the TCPlIP transmit and receive paths in some more detail. The steps described for transmit and receive paths are general and do not necessarily represent a particular stack implementation. The operations required in the transmit path are illustrated in Figure 4. When the application needs to transfer data, it passes a buffer containing data to the TCP/IP stack. The stack updates the TCP Control Block (TCB) associated with the socket. If the socket send buffer is set to a value greater than zero and if it has space available, then the stack will copy the application s data into this buffer. If it is set to 35

4 ? Figure 3. Microsoft Windows* TCP/IP Stack In case of receive, the NIC receives data, updates the receive descriptors supplied by the NIC driver with the checksum and other information and copies the descriptors and data into the memory (using DMA). The NIC will interrupt the processor after the DMA is done. Typically NlCs support a feature where they interrupt the processor either after accumulating certain number of packets or after certain time has elapsed since the last interrupt. This feature reduces the numbers of interrupts to the processor which are expensive. in terms of the computing power. This receive path is illustrated in Figure 5. The registered NIC device driver receives the interrupt and passes the buffer containing the data to the TCP/IP stack. The stack reads the TCP and IP header information from the buffer and updates the TCB associated with the socket for which this data was received. Based on the implementation and receive TCP window size, the stack determines when to send the acknowledgement for the received data. The data is then either copied into the socket receive buffer associated with that connection or placed directly into the application buffer if available. Figure 4. Details of the TCP/IP Transmit Path zero, then the stack will attempt to transmit the data directly from the application s buffer. In either case, the stack will hold on to the buffer with data until it gets an acknowledgement from the receiver indicating that the receiver has received the data without any problem. The send side TCP/IP stack keeps track of the receiver window size which indicates how much data the receiver is willing to accept at once. For small application buffer sizes, less than the Ethernet Frame size, this buffering is required to be turned on to get any reasonable throughput because it allows the stack to coalesce application s data while waiting for acknowledgement for previously transmitted data. This will reduce the number of small packets on the wire and hence the per packet overhead. If the application buffer is larger than the Maximum Segment Size (MSS) of the path then either the stack or the NIC (if LSO is on) will do the segmentation of the large buffers so that they do not get refused by a router on the path. For information on how the MSS is calculated, please refer to the TCP/IP RFC (191. The stack will compute the headers for the segments and passes them to the Ethemet driver which appends the Ethemet header to the TCP/IP header before transmitting the data and headers to the NIC using DMA. Figure S. Details of the TCP/IP Receive Path 111. MEASUREMENT-BASED CHARACTERIZATION In this section, we describe the methodology and tools used to evaluate TCP/IP packet processing characteristics and performance. A. Overview ofnttlcp NTttcp is a Microsoft* command-line sockets based tool based on the TTCP benchmark [26]. These tools are routinely used for measuring TCP and UDP performance between two end systems. When run, this tool outputs the throughput achieved, CPU utilization, number of interrupts generated by the NIC, packet size and etc. values. The NTttcp transmit-side achieves high network throughput by filling an application buffer with data, then repeatedly transmitting this data to a receiving node. Since it is largely running from memory, NTttcp thereby 36

5 enables a traffic transmitter and a receiver that operate at true network speeds. NTttcp also provides command line parameters to enable various available stack features as will be described in a subsequent section on TCPIIP modes of operation. It should be noted that NTttcp only stresses the data transfer portion of TCPIIP connections and does not cover connection management (setups and teardowns). The analysis of the processing requirements of connection management is not within the scope of this paper -- we are planning to study this in the future using a traffic generator called GEIST [15]. B. Overview of System Configurations Our mesurement configuration consists of the following systems - (1) the system under test (SUT) and (2) two clients (for source or sink during receive or transmit experiments). This configuration is illustrated in Figure 6. All three platforms run the Microsoft Windows* Server 2003 Enterprise Edition (RC2 version). The system under test consists of a PentiumO M processor running at 1.6 GHz and a 400 MHz Front Side Bus (FSB), the E7501 chipset with DDR-200 SDRAM memory, two PCI-X slots (133MHz and IOOMHz) and two dual-ported IGbit Intel Pro NICs. Two 4-way Intel@ ltanium@ 2 machines are used as the clients so that they won't become the bottlenecks when receiving or transmitting the data. The SUT system is connected to each client over the dual-ported NIC. This gives a total bandwidth of 4Ghps to the SUT. We measured the network throughput when transmitting or receiving buffer sizes ranging from 64 byte to 65536hytes over sixteen NTttcp streams using a single Pentium@ M processor. For the transmit test, the transmit version of NTttcp were run on the system under test and the receive versions were run on the sink servers. For the receive test, the configuration is the exact opposite. It is always ensured that the system is at steady state before any measurements are taken. el pemium M UP Pldform Source or Sink T w Sewen Figure 6. Measurement Configurations C. Performance Monitoring Tools The NTttcp tool provides the achieved throughput (Mbitds), the CPU utilization and several other performance data such as packets sentlreceived, interrupts per sec and cycles per byte. For a detailed characterization of the architectural components, we used the in-built performance monitoring events in the Pentium@ M processor. The event monitoring hardware provides several facilities including simple event counting, time-based sampling, event sampling and branch tracing which are extracted by an Intel@ Corp. developed tool called EMON. A detailed explanation of these techniques is not within the scope of this paper. Some the key processor events that were used in our performance analysis include the number of retired instructions, the un halted CPU cycles, the number of branches and the number of cache and TLB misses. Each event was sampled for a period of 10 seconds. We made sure that each test lasted long enough to allow data collection of the monitored events. D. Measured Modes of TCP/IP Operation There are several settings available to tune the TCP/IP stack to optimize performance for a given workload. These settings typically fall into two camps; 1) available NIC features and 2) available stack features. Stack Features: * Overlapped I10 with completion ports: This is the preferred way of performing network I10 in Microsofl Windows. 01s for the applications that need to support large number of connections. In this mode, applications post one or more buffers to send and receive data from the network and wait for completion messages to arrive in the completion port. Completion ports are an. efficient mechanism on Microsofl Windows; for sending notifications to the applications on the status of the I/O operations. The OIS locks the memory pages associated with these posted buffers so that they do not get swapped out. This allows TCPIIP to copy the data directly from the application buffer to the NIC using DMA for transmits and into the application buffer from the NIC buffer for receives, thus avoiding an extra memory copy. * SendReceive Buffers: The send and receive buffers dictate the buffer management of the TCPIIP stack. Every TCPIIP stack uses some default values for these parameters but applications can override these values for each socket. These parameters specify the amount of buffer space to be used by the TCP/IP stack for each socket to temporarily store the data. If the send or receive buffer is set to zero then the application has to make sure that it provides buffers fast enough to get the maximum throughput. 37

6 ~.. i NIC Features: * Large Segment Offload (LSO): This feature allows the TCPllP stack to send large segments (> MTU) of data to the NIC and let, the NIC break these large segments into MTU sized segments and compute TCPAP header information for these. TCPnP stack computes a single header for, the entire segment and passes that to the NIC along with the data. NIC uses this header information to compute the headers for the smaller segments. Obviously, this feature is helpful only on the transmit side and that too for application buffer sizes greater than the MTU. * Jumbo Frames: This feature allows the NIC to increase its maximum Ethemet frame size from 1514 to a higher value (4088, 9014, etc). This.should be enabled on both ends of the network as well as on all the intermediate nodes for,it to work. That is why the Jumbo frames are not common on the Intemet. But this can be used in autonomous networks, say in an enterprise data center where this can be enabled between some systems. Applications such as storage and file servers which typically transfer large packets can benefit from this feature. higher throughput is achieved with the same or lower CPU utilization compared to the regular Ethemet frame case. In case of the Jumbo frames, the socket send buffer is set to 64KB for application buffer sizes less than 16KB, so an extra data copy occurs for these. The impact of this copy seems to offset the benefit of the Jumbo frame for 4K and 8K buffer sizes and hence the lower throughput. This is confirmed by the observed higher path length (instructionsibit) numbers over the regular Ethemet frame case for these three application buffer sizes. Figure 7@) shows the receive side throughput and corresponding CPU utilization for regular and Jumbo Ethemet frames. For the most part, the use of Jumbo frames has yielded higher throughput by up to 22% (at 2k byte application buffer). Most of this benefit is due to the higher number of bytes received per intermpt. Table I summarizes the configuration parameters of our test setup as described earlier. NmCP Sellines 11 ofstream (Connections) IS per client Amlicalion Buffer Sircr (TCPnP 164 bwes lo Mk bytes. in. MWCO of 2 I,. loverlaorrd WO wilh 2 completion WN Send and Rec~ive Socket Bvmr Size IAoo Buffcr Sizehnd IReceive ' (a) Transmit Throughput and Utilization.. ~. ~. ~.. TCPlP Qnrk"nca (RXI Table 1. Configuration Parameters for Measurements IV. ANALYSIS AND CHARACTERIZATION OF TCP/IP PACKET PROCESSING In this section, we first provide an overview of the TCPIIP performance in terms of the throughput and CPU utilization, as measured by the NTttcp tool. We then characterize the TCPilP processing by measuring and analyzing the information from various event counters. 61 II 2% I?? 102!Mm ma ldid 8?W me4,ma-... -rs'*'"q ~.~ ~.~ ~ (b) Receive Throughput and Utilization Figure 7. TCP/IP Performance Analysis, A. Overview of TCP/IP Performance Figure 7(a) shows the observed transmit side throughput and CPU utilization for both the regular and the Jumbo Ethemet frame sizes. In case of the Jumbo frames, we were able to achieve the 4Gbps peak line rate for application buffer sizes equal to 16k bytes and higher. Except for 64, 4K and 8K byte application buffer sizes, 38

7 . B. Bits per Hertz We wanted to address the generally held notion of Ibps/lHz, so we have plotted the Bits (transferred or received) per Hertz in Figure 8. We find that this rule of thumb may only be applicable for transmitting (TX) hulk data over a pre-established connection. We actually observe that when using regular frames and transmitting 32KB buffers, we may be able to achieve up to 5 bps for every Hz. For receive-side (RX), the situation seems dismal - for Q 32KB buffer size, we are hardly able to achieve I bps for every Hz. For large buffer sizes, the use of Jumbo Ethernet frames further improve the achievable bps / Hz. For example, at 32KB buffer size, the transmit rate increases from -5 bps to -8 bps. The change for the receive-side is not as significant (from 0.8 bps to I bps per Hz). For small buffer sizes (TX and RX), we achieve much lower (-0.06 bps/hz for 64 byte buffers and -0.3 bps/hz for 512 byte buffers) than 1 bps/hz. Here Jumbo frames do not help for obvious reasons. Overall, it should be kept in mind that the CPU frequency required for a achieving a line rate not only depends on the processor architecture family but also on the bandwidth and latency that the platform provides and the efficiency of the protocol stack implementation (number of instructions needed in the execution path etc). We start to delve into these aspects much deeper in the next few sections. For this we just focus on the regular Ethemet frame case. throughput numbers for transmit over receive. At small buffer sizes, however, the path length for TX and RX is quite similar as well as much lower (for example: at 256 byte buffer sizes with regular frames, it requires roughly 4200 instructions to receive a buffer and 4600 instructions to transmit a buffer). The next step is to understand the number of CPU cycles needed to issue, execute and retire an instmction (denoted as CPI). The CPI numbers for various TCP/IP buffer sizes are shown in Figure IO. The first observation is that the TX CPI is relatively flat as compared to the RX CPI, which increases considerably with increase in buffer size. The increase in RX CPI with the buffer size is due to the fact that RX processing involves at least one memory copy of the received data which causes cache misses. more memory dependent (more cache misses). The observed dip in the RX CPl for 2KB, 4KB and 32KB buffer sizes is due to a reduction in the number of L1 and L2 misses for these buffer sizes. We are further investigating the cause for lower cache misses for these three buffer sizes..~.- ~ Sy.i.ml."sl CPI : *.+ 6- P,c+,.Q *@ *.* *& 4* &@, Bumr911. ~. Figure 10. System-Level CPI D. Micro-architectural Characteristics,?B *" 4%.$+,@ Q ** +? 1 2 & S"(h,SlZ. Figure 9. PathLength of TCP/IP data transfer C. CPl and Pathlength Characteristics We start presenting the architectural characteristics for TCPlIP data path processing by looking at the number of instructions executed and retired in a given time for each application buffer transferred or received. This data (termed as path length or PL) was collected using performance counters on the PentiumB M processor and is shown in Figure 9. From the figure it is apparent that as the buffer size increases, the number of instructions needed (path length) also increases significantly. A comparison of TX and RX path lengths reveal that the receive path requirea significantly more instructions (almost 5X) at large buffer sizes. This explains the larger In addition to CPI and pathlength, we have also measured the micro-architectural events to better understand the behavior of the TCP/IP Transmit and Receive processing. The micro-architectural events include branch frequency and misprediction, cache misses at various levels in the hierarchy and TLB misses. The detailed data (on a per buffer basis) is shown in Table 2 and 3. It is apparent from the table that the values for each event increase substantially with increase in buffer size. We also note that the percentage of branches (w.r.t. overall instructions) is roughly 20% irrespective of buffer size and for both transmit and receive processing. When comparing receive-side processing to transmit-side processing, we find that the number of memory references are significantly higher on the receive-side. Overall, we also find that as the buffer size increases, the rise in cache misses, TLB misses and branch mispredicts per instruction is more rapid for the receive side and more gradual for transmit side. 39

(Fast-Path) Packet Procerslng In SBNBr Workloads Table 2; Transmit-Side Processing Characteristics lpgc 5198 Tpcw 8.nshmark.~.. ~.. ~. Figure 11. TCPRP Processing in Server Workloads Table 3.

8 (Fast-Path) Packet Procerslng In SBNBr Workloads Table 2; Transmit-Side Processing Characteristics lpgc 5198 Tpcw 8.nshmark.~.. ~.. ~. Figure 11. TCPRP Processing in Server Workloads Table 3. Receive-Side Processing Characteristics v. IMPACT ON COMMERCIAL SERVER WORKLOADS In the previous section, we discussed the overall processing requirements of TCP/IP fast path (data transfers). In this section, we relate these processing requirements to commercial server workloads based on measurements done in our labs. The average path length pes operation or transaction for SPECweb99, TPC-W and TPC-C including its TCPfiP data path component is illustrated in Figure II. Measurements and modeling experiments of SPECweb99 and TPC-W benchmarks in our lab have revealed that the overall application path length is roughly and instructions respectively. and roughly 1.3 million instructions (if the entire data-set is cached in memory) for TPC-C. In case of SPECweb99, the receive-side per operation only receives an HTTP request which fits easily within 128 bytes. We have shown earlier in the paper that receiving 128 byte buffers requires around 4000 instructions. The transmit side for SPECweb99 sends an average file of roughly 14KB. To send this, it requires approximately instructions. As a result, out of the instructions per operation, roughly 18,000 instructions (-28%) are related to packet processing. Similarly for TPC-W, based on the number of messages communicated between the web server and the other systems (clients and database serves), the estimated message sizes and the number of instructions required for processing these by the TCPlIP stack (using NTncp), we found that roughly instructions were required for the packet processing alone. Out of this, 64% (-62,000) is transmit-side and 36% is received-side. Also, majority of the communication (84%) is due to the communication between the clients and the servers. Overall, we find that almost 29% of the instructions executed by TPC-W web servers are related to packet processing. In the case of TPC-C, without any disk I/O accesses, we noted above that the application path length is roughly 1.3 million instructions. Each storage 110 operation (over IP) will transmit or receive a physical page. Assuming an average of 25 storage 110 accesses per transaction, the number of instructions ansihuted to packet processing works out to just below 700,000 instructions. As a result, the overall path length per transaction (with storage over IP) can be estimated at roughly 2 million instructions; out of which the packet processing component is 35%. VI. CONCLUSIONS AND FUTURE WORK In this paper, we presented a detailed analysis of TCPnP fast path performance. We accomplished this through detailed measurements on a state-of-the-art low power Intel@ PentiumQ M microprocessor. We also used the performance monitoring events on the PentiumB M to understand the architectural characteristics of TCP/IP packet processing. We first looked at the achievable performance and associated CPU utilization for transmit and receive processing from 64 byte to 64KB buffer sizes. We then evaluated architectural characteristics by studying CPI, path length, cache performance, TLB performance and branch misprediction rates. We showed that transmit performance is largely compute hound, whereas receive throughput can be memory bound. We then related the path length required for packet processing to the overall path length required for each transaction (or operation) in commercial server workloads (SPECweb99, TPC-W and TPC-C). We showed that roughly 35% of TPC-C, -28% 40

9 of SPECweb99 and -29% of TPC-W workload execution can constitute TCPAP packet processing. Future work in this area is multi-fold. Several potential areas of architectural optimization that may help accelerate TCPilP packet processing need to be investigated - these include network-aware data prefetching and forwarding, protocol stack (software) optimizations and determining the useful architectural features of packet processing engines. We also plan to study the performance characteristics of connection management (set-ups and teardowns). We plan to accomplish this by modifying a web traffic generator that we built called GEIST [15]. Finally, we would also like to accurately quantify the amount of packet processing in several other interesting applications (mid-tier applications and IPC-intensive applications). ACKNOWLEDGEMENTS We would like to express our thanks to Michael Espig for providing us the necessary system infrastnrcture to be able to performance these measurements. We would also like to Dave Minturn and Ramesh Illikkal for his insight into the TCP/IPprotocol stacks and other members of our team for their helpfll input on this study. Finally, we also thank Raed Kanjo. James Mead and Will Auld for providing us insights in the measurement and analysis of commercia1 workloads, NOTICES C3 is a trademark or registered trademark of Intel Corporation or its subsidiaries in the UnitedStates and other countries. * Other names and brands may be claimed as the properry of others. REFERENCES [I] Alacritech SLIC: A Data Path TCP Offload methodology, (21 J. Chase et. al., End System Optimizations for High-speed TCP, IEEE Communications, Special Issue on High- Speed TCP, June [3] D. Clark et. al., An analysis of TCP Processing overhead, IEEE Communications, June [4] D. Clark et. al., Architectural Considerations for a new generation of Protocols, ACM SIGCOMM, Sept [SI A. Earls, TCP Ofload Engines Finally Arrive, Storage Magazine, March [6] A. Foong et al., TCP Performance Analysis Re-visited, IEEE International Symposium on Performance Analysis of SoRware and Systems, March [7] S. Gochman, et al., The Intel@ Pentiuma M Processor: Microarchitecture and Performance. Intel Technology Journal. May [RI High Performance Network Adapters and Drivers in Windows, wdec/tech/, December [9] Y. Hoskote, et al., A IOGHz TCP Offload Accelerator for IOGbIs Ethernet in 90nm Dual-Vt CMOS, IEEE International Solid-State Circuits Conference (ISSCC). San Francisco, [IO] R. Huggahalli and R. lyer, Direct Cache Access for Coherent Network 110, submitted to an international conference, [ I] ISCSI, IP Storage Working Group, Internet Draft, available at 20.txt [I21 R. lyer, CASPER Cache Architecture Simulation and Performance Exploration using Refstreams, Intel Design and Test Technology Conference (DITC), July [I31 R. Iyer, On Modeling and Analyzing Cache Hierarchies using CASPER. I Ith Symposium on Modeling, Analysis and Simulation of Comauter and Telecommunication Systems, Oct K. Kant. TCP offload oerformance for front-end servers: to appear in Globecom, San Francisco, [IS] K. Kant, V. Tewari and R. lyer, GEIST- A Generator for E-commerce and Internet Server Trafic, 2001 IEEE International Symposium on Performance Analysis of Systems and SoRware (ISPASS), Oct 2001 [I61 H. Kim, V. Pai and S. Rixner, Increasing Web Server Throughput with Network Interface Data Caching, International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp , San Jose, CA, (October, 2002). [I71 D. McConnell, IP Storage: A Technology Overview, available at ipstorage.htm [I81 J. Mogul, TCP offload is a dumb idea whose time has come. A Svmaosium on Hot Oaeratinz _. Svstems (HOT OS), Stanford, i B. Postel. Transmission Control Protocol. RFC 793. Information Sciences Institute, Sept [ZO] M. Rangarajan et al., TCP Servers: Ofloading TCPIIP Processing in Internet Servers. Design, Implementation, and Performance, Rutgers University, Dept of Computer Science Technical Report, DCS-TR-481, March [21] G. Regnier et al., ETA: Experience with an Intel Xeon Processor as a Packet Processing Engine, A Symposium on High Performance Interconnects (HOT Interconnects), Stanford, [22] RDMA Consortium. [23] Remote Direct Data Placement Working Group. [241 P. Sarkar, S. Ummchandani, and K. Voruganti., Storage over IP: Does hardware support help? In Proc. 2nd USENIX Conf. on File and Storage Technologies, pages , CA, March [XI SPECweb99 Design Document, available online at hnp:// [261 The TTTCP Benchmark, [27] TPC-C Design Document, [28] TPC-W Design Document, [29] S. Shankland, HP to sharpen blade with Pentium M, IO html 41

Measurement-based Analysis of TCP/IP Processing Requirements

Measurement-based Analysis of TCP/IP Processing Requirements Srihari Makineni Ravi Iyer Communications Technology Lab Intel Corporation {srihari.makineni, ravishankar.iyer}@intel.com Abstract With the